Identification and use of conserved noncoding sequences

ABSTRACT

The invention is directed to the identification of conserved noncoding sequences in various organisms and the use of those sequences as primers for use in amplifying, mapping and analysis of DNA among different species.

RELATED APPLICATION

This application claims priority to U.S. Provisional Application Ser. No. 60/303,356 filed Jul. 6, 2001, which is hereby incorporated by reference in its entirety.

This invention was made with Government support order Grant (Contract) No. R01GM42610 awarded by the National Institutes of Health. The government has certain rights in this invention.

FIELD OF THE INVENTION

This invention is in the field of molecular biology. In particular this invention relates to the identification and use of conserved noncoding sequences (CNS) for molecular and genetic analysis.

BACKGROUND OF THE INVENTION

While much progress has been made in recent years in traditional molecular and genetic mapping, sequencing of genomes and molecular analysis of gene expression, there is still a tremendous need to develop improved techniques for molecular and genetic analysis within and between species.

Molecular markers are common tools that can reveal polymorphism directly at the DNA level and are used for genetic resource assessment, molecular analysis and genetic mapping. Various types of markers have been developed: RFLP: restriction fragment length polymorphism; PCR: polymerase chain reaction based markers; SCAR: sequence characterized amplified region; SSR: simple sequence repeats (microsatellites); ISSR: intersimple sequence repeats; STS: sequence tagged sites and AFLP: amplified fragment length polymorphisms. Although these methods are powerful, they are useful only within one species or genus because the markers are not from genes shared by larger taxonomic groups. There is thus a need in the art to develop improved methods of genetic mapping and molecular analysis within and across different kingdoms.

SUMMARY OF THE INVENTION

In order to meet these needs, the present invention is directed to CNSs (conserved noncoding sequences) and their use in molecular analysis and genetic mapping in any organism. The conserved noncoding sequences (CNSs) of the invention find many applications in molecular analysis and genetic mapping. In particular, CNSs find use in identifying polymorphisms between two or more DNA sequences. CNSs find further use in mapping including cross-species alignment of DNA sequences and alignment of syntenic regions outside of coding regions.

CNSs may also be utilized in gene cloning including capturing or isolating of sets of related genes.

CNSs may be also utilized for analyzing promoter sequences including (a) identifying functional cis-acting sequences within a promoter; (b) identifying binding sites for transcription factor complexes; (c) isolating or physically capturing DNA binding proteins/transcription factors; (d) linking binding sites to appropriate transcription factors via gene expression profiles; (e) designing promoters with novel expression characteristics and (f) titrating out transcription factors via replication of CNS-containing DNA.

The CNSs of the invention can also be utilized to alter the expression level and specificity of endogenous genes by in situ modification of CNSs.

In one format of the invention, combinations of CNSs or CNS-exon primer combinations are used to generate PCR fragments including highly polymorphic regions and some exon regions. Such a fragment, once shown to be linked with a characteristic of interest in any species, (e.g. any grass) can be sequenced. The sequence, in turn, is used to anchor the linked gene to the nearest species genetic map.

The present invention is further directed to methods of genetic mapping using the CNSs of the invention. The present invention is further directed to kits containing various CNSs for use in mapping such CNSS.

The present invention is further directed to a method of identifying a polymorphism between two or more DNA sequences by a) selecting a first and a second primer for use in an amplification reaction wherein the first or second primer is a CNS in the two or more DNA sequences; b) hybridizing the first and second primers to the two or more DNA sequences to form primer-DNA hybrids; c) amplifying the primer-DNA hybrids to form amplified DNA products; and d) separating the DNA products by size to identify a polymorphism between said two or more DNA sequences.

The DNA sequences may be from any organism including plants.

In one format of the method of the invention, the polymorphism is utilized for mapping.

In the method of detecting a polymorphism of the invention, the first primer may be a CNS and the second primer may be an exon sequence. Alternatively, both primers may be CNSs.

In the method of detecting a polymorphism of the invention the DNA products may be digested with a restriction endonuclease prior to separation in step d). After digestion, the DNA products may be separated by electrophoresis.

In one format of the invention, the amplification primers are PCR primers.

The primer-DNA hybrids may be amplified using any amplification technique. Such amplification techniques include cloning, PCR, LCR, TAS, 3SR, NASBA and Qβ amplification.

The present invention is further directed to a method of mapping a polymorphism between two or more species by a) selecting a first and a second primer for use in an amplification reaction wherein the first or second primer is a CNS in the genomes of the two or more species wherein the genomes include genomic DNA; b) hybridizing the first and second primers to the genomic DNA to form primer-genomic DNA hybrids; c) amplifying the primer-genomic DNA hybrids to form amplified DNA products; and d) separating the DNA products by size to map a polymorphism between the two or more DNA sequences.

The present invention is further directed to kits including the CNSs of the invention where the kits find use in detecting polymorphisms and mapping. The kits may include all the components necessary to carry out an amplification reaction utilizing the CNSs. Such components may include primers, enzymes, directions for use, etc.

The present invention is further directed to a method of identifying cis-acting genomic DNA sequences that alter gene expression in plants by a) obtaining the genomic DNA sequences of orthologous genes from two or more related plants; b) comparing the non-coding regions of the genomic DNA sequences to identify conserved non-coding sequences at least 7 nucleotides in length and c) testing the CNS in a plant to determine whether the CNS alters gene expression in the plant to thereby identify the CNS as a cis-acting genomic DNA sequences that alters gene expression in plants.

In the method of identifying cis-acting genomic DNA sequences that alter gene expression in plants, the genomic DNA sequence of the orthologous genes may be obtained by searching DNA sequence databases or by directly sequencing the genomic DNA.

In the method of identifying cis-acting genomic DNA sequences that alter gene expression of the invention the testing of the CNS in a plant may include the steps of: i) creating a CNS-reporter gene construct; ii) introducing the CNS-reporter gene construct into a plant to create a transgenic plant and iii) measuring expression levels of the reporter gene in the transgenic plant.

The present invention is further directed to a method of identifying a binding site for transcription factor in a plants, by a) obtaining the genomic DNA sequences of orthologous genes from two or more related plants; b) comparing the non-coding regions of the genomic DNA sequences to identify conserved non-coding sequences at least 7 nucleotides in length and c) testing the CNS to determine whether the CNS is a binding site for a transcription factor complex.

In the method of identifying a binding site for a transcription factor of the invention, the genomic DNA sequence of the orthologous genes may be obtained by searching DNA sequence databases or by directly sequencing the genomic DNA.

In the method of identifying a binding site for a transcription factor of the invention, the CNS may be tested by a gel shift-binding assay.

The present invention is further directed to a method of identifying a conserved noncoding sequence in the genomes of at least two plants by a) identifying coding regions in the genomes; b) masking out the coding regions to formed masked genomic sequences and unmasked genomic sequences in said genomes; c) comparing the unmasked genomic sequences to identify two or more conserved noncoding sequences wherein the conserved noncoding sequences contain at least 7 identical nucleotides to identify the conserved non-coding sequences.

In the method of identifying a conserved noncoding sequence of the invention, the unmasked genomic sequences may be compared using a word size of seven, a gap penalty existence from 2-5 and a gap penalty extension from 1-2.

The method of identifying a conserved noncoding sequence may further include identifying the position of the conserved noncoding sequences relative to the masked genomic sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows lg1 CNSs conserved between maize and rice by presenting a graphic display of CNS locations and sizes found when comparing 7142 bp of maize sequence at lg1 with 7100 bp of orthologous rice sequence. In FIG. 1A, the Blast parameters were adjusted to favor shorter, more stringent CNSs. In FIG. 1B, the Blast parameters adjusted to favor longer, less exact matches. FIG. 1C shows a 5 grass alignment of lg1 intron 1.

FIG. 2 shows that lg1 CNS3 is conserved in grasses. CNS3 was used as a PCR primer site, paired with a lg1 exon 1 site, to amplify the promoter, 5′ UTR and first exon regions of lg1 from various grasses. FIG. 2A shows an ethidium bromide stained gel of the PCR products. The gel was transferred onto a nylon membrane and hybridized with a combined lg1 exon1 region from maize and rice. FIG. 2B shows the autoradiogram after hybridization. In FIGS. 2A and 2B Lane 1=rice (Oryza sativa), 2=weedy foxtail (Setaria viridis), 3=sorghum (Sorghum bicolor), 4=maize (Zea mays), 5=Muhlenbergia porteri, 6=Arundo donax, 7=a bamboo (Bambusa multiplex).

FIG. 3A shows piled-up sequences from various plant genes containing the smaller CNS sequence: TTATTACATGPyG (SEQ ID NO: 1). FIG. 3B is a line up of CNSlg3-i2 from representative grasses. FIG. 3C shows phylogenetic relationships between different plants. The asterisk (*) in FIG. 3C indicates the species in which useful conservation of CNS sequences has been demonstrated.

FIG. 4 shows amplification reaction products from non-domesticated (wild species) using CNS. FIG. 4A shows PCR products using primers CNS3-2 and Lg1-77. Lanes from left to right, 6 Setaria fabreii (2425, 641, 1816, 2418, 2423 and 2448) and 5 Setaria glauca (Z3, 003, 091, 098 and 2406) FIG. 4B shows PCR products digested with Msp I.

DETAILED DESCRIPTION OF THE INVENTION

In order to more fully understand the invention, the following definitions are provided.

An amplification pattern “database” denotes a set of stored data which represent a collection of amplification patterns generated by amplifying nucleic acids derived from a plurality of test samples using the multi-site test device of the present invention.

The term “conserved noncoding sequence,” “(CNS)” refers to evolutionarily conserved stretches of DNA that does not encode protein. Conserved noncoding sequences are generally located near exons and are more than 7 nucleotides in length.

The term “pan grass mapping marker” refers to any marker that is useful in most, many or all members of the Gramineae (grass family), a family of flowering plants that includes the majority of food crops.

The term “PCR” refers to the Polymerase Chain Reaction.

The phrase “mixed defined plant population” refers to a plant population containing many different families and lines of plants. Typically, the defined plant population exhibits a quantitative variability for a phenotype that is of interest.

The phrase “multiple plant families” refers to different families of related plants within a population.

The term “exon” refers to specific regions of DNA, which code for protein.

A “polymorphism” is a change or difference between two related nucleic acids. A “nucleotide polymorphism” refers to a nucleotide, which is different in one sequence when compared to a related sequence when the two nucleic acids are aligned for maximal correspondence. A “genetic nucleotide polymorphism” refers to a nucleotide which is different in one sequence when compared to a related sequence when the two nucleic acids are aligned for maximal correspondence, where the two nucleic acids are genetically related, i.e., homologous, e.g., where the nucleic acids are isolated from different strains of a plant, or from different alleles of a single strain, or the like.

A “QTL” or “quantitative trait locus” include genes that control, to some degree, numerically representable phenotypic traits (disease resistance, crop yield, resistance to environmental extremes, etc.), that are distributed within a family of individuals as well as within a population of families of individuals. To measure QTLS, two inbred lines are typically crossed and multiple marker loci are genotyped, with one to several quantitative phenotypic traits among the progeny of the cross being evaluated. QTL are then identified and ultimately selected for based on significant statistical associations between the genotypic values determined by genetic marker technology and the phenotypic variability among the segregating progeny. Typical QTL include yield, grain moisture, grain oil, root lodging, stalk lodging, plant height, ear height, disease resistance, insect resistance, resistance to soybean cyst nematode, resistance to brown stem rot, resistance to phytopthora rot, and many others.

As used herein, the term “gene” is used broadly to refer to any segment of nucleic acid associated with a biological function. Thus, genes include coding sequences and/or the regulatory sequences required for their expression. For example, gene refers to a nucleic acid fragment that expresses mRNA or functional RNA, or encodes a specific protein, and which includes regulatory sequences. Genes also include nonexpressed DNA segments that, for example, form recognition sequences for other proteins. Genes can be obtained from a variety of sources, including cloning from a source of interest or synthesizing from known or predicted sequence information, and may include sequences designed to have desired parameters. The term “native” or “wild type” gene refers to a gene that is present in the genome of an untransformed cell, i.e., a cell not having a known mutation.

A “marker gene” encodes a selectable or screenable trait.

The term “chimeric gene” refers to any gene that contains 1) DNA sequences, including regulatory and coding sequences, that are not found together in nature, or 2) sequences encoding parts of proteins not naturally adjoined, or 3) parts of promoters that are not naturally adjoined. Accordingly, a chimeric gene may comprise regulatory sequences and coding sequences that are derived from different sources, or comprise regulatory sequences and coding sequences derived from the same source, but arranged in a manner different from that found in nature.

A “transgene” refers to a gene that has been introduced into the genome by transformation and is stably maintained. Transgenes may include, for example, genes that are either heterologous or homologous to the genes of a particular plant to be transformed. Additionally, transgenes may comprise native genes inserted into a non-native organism, or chimeric genes. The term “endogenous gene” refers to a native gene in its natural location in the genome of an organism. A “foreign” gene refers to a gene not normally found in the host organism but that is introduced by gene transfer.

An amplification “primer” corresponding to a conserved noncoding sequence of the invention refers to primers for use in amplification reactions. Primers may be about 30 or fewer nucleotides in length (e.g., 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 20, 21, 22, 23, or 24, or any number between 9 and 30). Generally specific primers are upwards of 14 nucleotides in length. For optimum specificity and cost effectiveness, primers of 16 to 24 nucleotides in length may be preferred. Those skilled in the art are well versed in the design of primers for use in amplification processes such as PCR. For use as amplification primers, conserved noncoding sequences of 15, 16, 17 and 18 mers and longer oligonucleotides generally find use in the invention. For computational analysis, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 mers and longer oligonucleotides find use in the invention. When smaller oligonucleotides (e.g. 7 mers) are used in computational analysis, clusters of oligomers are generally required in addition to the use of algorithms to interpret the data.

“Coding sequence” refers to a DNA or RNA sequence that codes for a specific amino acid sequence and excludes the non-coding sequences. It may constitute an “uninterrupted coding sequence”, i.e., lacking an intron, such as in a cDNA or it may include one or more introns bounded by appropriate splice junctions. An “intron” is a sequence of RNA which is contained in the primary transcript but which is removed through cleavage and re-ligation of the RNA within the cell to create the mature mRNA that can be translated into a protein.

The terms “open reading frame” and “ORF” refer to the amino acid sequence encoded between translation initiation and termination codons of a coding sequence. The terms “initiation codon” and “termination codon” refer to a unit of three adjacent nucleotides (‘codon’) in a coding sequence that specifies initiation and chain termination, respectively, of protein synthesis (mRNA translation).

A “functional RNA” refers to an antisense RNA, ribozyme, or other RNA that is not translated. Functional RNA is encoded by genes that encode RNA only.

The term “RNA transcript” refers to the product resulting from RNA polymerase catalyzed transcription of a DNA sequence. When the RNA transcript is a perfect complementary copy of the DNA sequence, it is referred to as the primary transcript or it may be a RNA sequence derived from posttranscriptional processing of the primary transcript and is referred to as the mature RNA. “Messenger RNA” (mRNA) refers to the RNA that is without introns and that can be translated into protein by the cell. “cDNA” refers to a single- or a double-stranded DNA that is complementary to and derived from mRNA.

“Regulatory sequences” and “suitable regulatory sequences” each refer to nucleotide sequences located upstream (5′ non-coding sequences), within, or downstream (3′ non-coding sequences) of a coding sequence, and which influence the transcription, RNA processing or stability, or translation of the associated coding sequence. Regulatory sequences include enhancers, promoters, translation leader sequences, introns, and polyadenylation signal sequences. They include natural and synthetic sequences as well as sequences which may be a combination of synthetic and natural sequences. The term “suitable regulatory sequences” is not limited to promoters.

“5′ non-coding sequence” refers to a nucleotide sequence located 5′ (upstream) to the coding sequence. It is present in the fully processed mRNA upstream of the initiation codon and may affect processing of the primary transcript to mRNA, mRNA stability or translation efficiency.

“3′ non-coding sequence” refers to nucleotide sequences located 3′ (downstream) to a coding sequence and include polyadenylation signal sequences and other sequences encoding regulatory signals capable of affecting mRNA processing or gene expression. The polyadenylation signal is usually characterized by affecting the addition of polyadenylic acid tracts to the 3′ end of the mRNA precursor.

“Promoter” refers to a nucleotide sequence, usually upstream (5′) to its coding sequence, which controls the expression of the coding sequence by providing the recognition for RNA polymerase and other factors required for proper transcription. “Promoter” includes a minimal promoter that is a short DNA sequence comprised of a TATA box and other sequences that serve to specify the site of transcription initiation, to which regulatory elements are added for control of expression. “Promoter” also refers to a nucleotide sequence that includes a minimal promoter plus regulatory elements that is capable of controlling the expression of a coding sequence or functional RNA. This type of promoter sequence consists of proximal and more distal upstream elements, the latter elements often referred to as enhancers. Accordingly, an “enhancer” is a DNA sequence, which can stimulate promoter activity, and may be an innate element of the promoter or a heterologous element inserted to enhance the level or tissue specificity of a promoter. It is capable of operating in both orientations (normal or flipped), and is capable of functioning even when moved either upstream or downstream from the promoter. Both enhancers and other upstream promoter elements bind sequence-specific DNA-binding proteins that mediate their effects. Promoters may be derived in their entirety from a native gene, or be composed of different elements derived from different promoters found in nature, or even be comprised of synthetic DNA segments. A promoter may also contain DNA sequences that are involved in the binding of protein factors known as “transcription factors” which control the effectiveness of transcription initiation in response to physiological or developmental conditions.

The “initiation site” is the position surrounding the first nucleotide that is part of the transcribed sequence, which is also defined as position +1. With respect to this site all other sequences of the gene and its controlling regions are numbered. Downstream sequences (i.e., further protein encoding sequences in the 3′ direction) are denominated positive, while upstream sequences (mostly of the controlling regions in the 5′ direction) are denominated negative.

Promoter elements, particularly a TATA element, that are inactive or that have greatly reduced promoter activity in the absence of upstream activation are referred to as “minimal or core promoters.” In the presence of a suitable transcription factor, the minimal promoter functions to permit transcription. A “minimal or core promoter” thus consists only of all basal elements needed for transcription initiation, e.g., a TATA box and/or an initiator.

The terms “heterologous DNA sequence,” “exogenous DNA segment” or “heterologous nucleic acid,” as used herein, each refer to a sequence that originates from a source foreign to the particular host cell or, e.g. the CNSs of the present invention, if from the same source, is modified from its original form. Thus, a heterologous gene in a host cell includes a gene that is endogenous to the particular host cell but has been modified through, for example, the use of DNA shuffling. The terms also include non-naturally occurring multiple copies of a naturally occurring DNA sequence. Thus, the terms refer to a DNA segment that is foreign or heterologous to the cell, or homologous to the cell but in a position within the host cell nucleic acid in which the element is not ordinarily found. Exogenous DNA segments are expressed to yield exogenous polypeptides. A “homologous” DNA sequence is a DNA sequence that is naturally associated with a host cell into which it is introduced.

“Homologous to” in the context of nucleotide sequence identity refers to the similarity between the nucleotide sequence of two nucleic acid molecules or between the amino acid sequences of two protein molecules. Estimates of such homology are provided by either DNA-DNA or DNA-RNA hybridization under conditions of stringency as is well understood by those skilled in the art (as described in Haines and Higgins (eds.), Nucleic Acid Hybridization, IRL Press, Oxford, U.K.), or by the comparison of sequence similarity between two nucleic acids or proteins.

The term “altered plant trait” means any phenotypic or genotypic change in a transgenic plant relative to the wild-type or non-transgenic plant host.

“Genome” refers to the complete genetic material of an organism.

The term “nucleic acid” refers to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form, composed of monomers (nucleotides) containing a sugar, phosphate and a base, which is either a purine or pyrimidine. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides which have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer, et al., 1991; Ohtsuka, et al., 1985; Rossolini, et al. 1994). A “nucleic acid fragment” is a fraction of a given nucleic acid molecule. In higher plants, deoxyribonucleic acid (DNA) is the genetic material while ribonucleic acid (RNA) is involved in the transfer of information contained within DNA into proteins. The term “nucleotide sequence” refers to a polymer of DNA or RNA which can be single- or double-stranded, optionally containing synthetic, non-natural or altered nucleotide bases capable of incorporation into DNA or RNA polymers. The terms “nucleic acid” or “nucleic acid sequence” may also be used interchangeably with gene, cDNA, DNA and RNA encoded by a gene.

The nucleotide sequences of the invention include both the naturally occurring CNS sequences as well as mutant (variant) forms. Such variants will continue to possess the desired activity, i.e., hybridizing to a region of interest of the non-variant nucleotide sequence.

“Cloning vectors” typically contain one or a small number of restriction endonuclease recognition sites at which foreign DNA sequences can be inserted in a determinable fashion without loss of essential biological function of the vector, as well as a marker gene that is suitable for use in the identification and selection of cells transformed with the cloning vector. Marker genes typically include genes that provide tetracycline resistance, hygromycin resistance or ampicillin resistance.

A “transgenic plant” is a plant having one or more plant cells that contain an expression vector.

“Plant tissue” includes differentiated and undifferentiated tissues or plants, including but not limited to roots, stems, shoots, leaves, pollen, seeds, tumor tissue and various forms of cells and culture such as single cells, protoplast, embryos, and callus tissue. The plant tissue may be in plants or in organ, tissue or cell culture.

The following terms are used to describe the sequence relationships between two or more nucleic acids or polynucleotides: (a) “reference sequence”, (b) “comparison window”, (c) “sequence identity”, (d) “percentage of sequence identity”, and (e) “substantial identity”.

(a) As used herein, “reference sequence” is a defined sequence used as a basis for sequence comparison. A reference sequence may be a subset or the entirety of a specified sequence for example, as a segment of a full length cDNA or gene sequence, or the complete cDNA or gene sequence.

(b) As used herein, “comparison window” makes reference to a contiguous and specified segment of a polynucleotide sequence, wherein the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. Generally, the comparison window is at least 20 contiguous nucleotides in length, and optionally can be 30, 40, 50, 100, or longer. Those of skill in the art understand that to avoid a high similarity to a reference sequence due to inclusion of gaps in the polynucleotide sequence a gap penalty is typically introduced and is subtracted from the number of matches.

Methods of alignment of sequences for comparison are well known in the art. Thus, the determination of percent identity between any two sequences can be accomplished using a mathematical algorithm. Preferred, non-limiting examples of such mathematical algorithms are the algorithm of Myers and Miller, 1988; the local homology algorithm of Smith et al. 1981; the homology alignment algorithm of Needleman and Wunsch 1970; the search-for-similarity-method of Pearson and Lipman 1988; the algorithm of Karlin and Altschul, 1990, modified as in Karlin and Altschul, 1993.

Computer implementations of these mathematical algorithms can be utilized for comparison of sequences to determine sequence identity. Such implementations include, but are not limited to: CLUSTAL in the PC/Gene program (available from Intelligenetics, Mountain View, Calif.); the ALIGN program (Version 2.0) and GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Version 8 (available from Genetics Computer Group (GCG), 575 Science Drive, Madison, Wis., USA). Alignments using these programs can be performed using the default parameters. The CLUSTAL program is well described by Higgins et al. 1988; Higgins et al. 1989; Corpet, et al. 1988; Huang, et al. 1992; and Pearson, et al. 1994. The ALIGN program is based on the algorithm of Myers and Miller, supra. The BLAST programs of Altschul et al., 1990, are based on the algorithm of Karlin, and Altschul supra.

Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information. This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul, et al., 1990). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always <0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity X from its maximum achieved value, the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments, or the end of either sequence is reached.

In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin & Altschul (1993). One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a test nucleic acid sequence is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid sequence to the reference nucleic acid sequence is less than about 0.1, more preferably less than about 0.01, and most preferably less than about 0.001.

To obtain gapped alignments for comparison purposes, Gapped BLAST (in BLAST 2.0) can be utilized as described in Altschul, et al. 1997. Alternatively, PSI-BLAST (in BLAST 2.0) can be used to perform an iterated search that detects distant relationships between molecules. See Altschul et al., supra. When utilizing BLAST, Gapped BLAST, PSI-BLAST, the default parameters of the respective programs (e.g. BLASTN for nucleotide sequences, BLASTX for proteins) can be used. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11, an expectation (E) of 10, a cutoff of 100, M=5, N=−4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see Henikoff & Henikoff, 1989). Alignment may also be performed manually by inspection.

For purposes of the present invention, comparison of nucleotide sequences for determination of percent sequence identity to the promoter sequences disclosed herein is preferably made using the BlastN program (version 1.4.7 or later) with its default parameters or any equivalent program. By “equivalent program” is intended any sequence comparison program that, for any two sequences in question, generates an alignment having identical nucleotide or amino acid residue matches and an identical percent sequence identity when compared to the corresponding alignment generated by the preferred program.

(c) As used herein, “sequence identity” or “identity” in the context of two nucleic acid or polypeptide sequences makes reference to the residues in the two sequences that are the same when aligned for maximum correspondence over a specified comparison window. When percentage of sequence identity is used in reference to proteins it is recognized that residue positions which are not identical often differ by conservative amino acid substitutions, where amino acid residues are substituted for other amino acid residues with similar chemical properties (e.g., charge or hydrophobicity) and therefore do not change the functional properties of the molecule. When sequences differ in conservative substitutions, the percent sequence identity may be adjusted upwards to correct for the conservative nature of the substitution. Sequences that differ by such conservative substitutions are said to have “sequence similarity” or “similarity.” Means for making this adjustment are well known to those of skill in the art. Typically this involves scoring a conservative substitution as a partial rather than a full mismatch, thereby increasing the percentage sequence identity. Thus, for example, where an identical amino acid is given a score of 1 and a non-conservative substitution is given a score of zero, a conservative substitution is given a score between zero and 1. The scoring of conservative substitutions is calculated, e.g., as implemented in the program PC/GENE (Intelligenetics, Mountain View, Calif.).

(d) As used herein, “percentage of sequence identity” means the value determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison, and multiplying the result by 100 to yield the percentage of sequence identity.

(e)(i) The term “substantial identity” of polynucleotide sequences means that a polynucleotide comprises a sequence that has at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, or 79%, preferably at least 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, or 89%, more preferably at least 90%, 91%, 92%, 93%, or 94%, and most preferably at least 95%, 96%, 97%, 98%, or 99% sequence identity, compared to a reference sequence using one of the alignment programs described using standard parameters. One of skill in the art will recognize that these values can be appropriately adjusted to determine corresponding identity of proteins encoded by two nucleotide sequences by taking into account codon degeneracy, amino acid similarity, reading frame positioning, and the like. Substantial identity of amino acid sequences for these purposes normally means sequence identity of at least 70%, more preferably at least 80%, 90%, and most preferably at least 95%.

Another indication that nucleotide sequences are substantially identical is if two molecules hybridize to each other under stringent conditions (see below). Generally, stringent conditions are selected to be about 5° C. lower than the thermal melting point (T_(m)) for the specific sequence at a defined ionic strength and pH. However, stringent conditions encompass temperatures in the range of about 1° C. to about 20° C., depending upon the desired degree of stringency as otherwise qualified herein. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the polypeptides they encode are substantially identical. This may occur, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code. One indication that two nucleic acid sequences are substantially identical is when the polypeptide encoded by the first nucleic acid is immunologically cross reactive with the polypeptide encoded by the second nucleic acid.

For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are input into a computer, subsequence coordinates are designated if necessary, and sequence algorithm program parameters are designated. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence(s) relative to the reference sequence, based on the designated program parameters.

As noted above, another indication that two nucleic acid sequences are substantially identical is that the two molecules hybridize to each other under stringent conditions. The phrase “hybridizing specifically to” refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucleotide sequence under stringent conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA or RNA. “Bind(s) substantially” refers to complementary hybridization between a probe nucleic acid and a target nucleic acid and embraces minor mismatches that can be accommodated by reducing the stringency of the hybridization media to achieve the desired detection of the target nucleic acid sequence.

“Recombinant DNA molecule” is a combination of DNA sequences that are joined together using recombinant DNA technology and procedures used to join together DNA sequences as described, for example, in Sambrook et al., 1989.

The word “plant” refers to any plant, particularly to seed plant, and “plant cell” is a structural and physiological unit of the plant, which comprises a cell wall but may also refer to a protoplast. The plant cell may be in form of an isolated single cell or a cultured cell, or as a part of higher organized unit such as, for example, a plant tissue, or a plant organ.

Taking into account these definitions, the present invention is directed to the identification of CNSs and their use in molecular and genetic analyses. While the specific example below describes the use of various techniques in the analysis of plants using CNSs, the techniques of this invention find use in the analysis of all organisms.

Introduction to CNSs

The elucidation of gene regulatory networks depends on identification of cis-acting elements and the products that bind them. Cross species genomic DNA comparisons offer a powerful method for finding cis regulatory sequences (Hardison, R. C. (2000) Trends in Genetics 16, 369-372). Control of gene expression requires cis acting regulatory DNA sequences. Historically these sequences have been difficult to identify. Conserved noncoding sequences (CNSs) have recently been identified in mammalian genes through cross species genomic DNA comparisons.

Regulatory elements have historically been difficult to identify because, unlike exons, they are not constrained by the codon syntax necessary for use in translation. Traditionally, cis-acting regulatory elements have been identified through biochemical and molecular genetic experimental approaches. Regulatory sequences are generally small, e.g., six to eight nucleotides long (Tjian, R. (1995) Scientific American 272, 54-61; Ficket, J. W. & Hatzigeorgiou, A. G. (1997) Genome Research 7, 861-878; Bucher, P. (1999) Current Opinion in Structural Biology 9, 400-407; Sumiyama, K. et al. (2001) Genomics 71, 260-262). A recent advance in identifying mammalian cis regulatory elements was made by leveraging the power of genomic DNA comparisons based on the premise that these functionally important regulatory regions will be conserved between species (Hardison, R. C. (2000) Trends in Genetics 16, 369-372). Comparisons between a syntenous human and mouse genomic DNA segment identified CNSs that were experimentally shown to regulate the expression of several nearby genes (Loots, G. G. et al. (2000) Science (Washington D.C.) 288, 136-140). CNSs correspond to regions of endonuclease sensitivity in man/mouse comparisons (Gottgens, B. et al. (2001) Genome Research 11, 87-97) as well as other previously defined regulatory elements (Sumiyama et al. (2001); Flint, J. et al. (2001) Human Molecular Genetics 10, 371-382; Stojanovic, N. et al. (1999) Nucleic Acids Research 27, 3899-3910). A study of 502 human genes showed that regulatory sequences are enriched in CNSs compared to nonconserved noncoding regions (Levy, S. et al. (2001) Bioinformatics (Oxford) 17, 871-877). Most identified CNSs have no experimentally defined function (Dubchak, I. et al. (2000) Genome Research 10, 1304-1306). There are now several web based tools that can be used to identify CNSs (Mayor, C. et al. (2000) Bioinformatics (Oxford) 16, 1046-1047; Schwartz, S. et al. (2000) Genome Research 10, 577-586) and functional RNAs (Carter, R. J. et al. (2001) Nucleic Acids Research 29, 3928-3938).

Rice and maize are two domesticated species in the Poaceae, the grass family of flowering plants, which includes wheat, barley, sugarcane, and other cultivated staples. This family has about 10,000 species that have been grouped together based on shared morphological traits and nucleotide sequences. [Phylogenies built upon these data suggest that grasses are monophyletic, with an ancestral genome existing about 50 million years ago (MYA) for almost all of the common grasses, and perhaps 70 MYA if the basal grass genera are included (Kellogg, E. A. (2001) Plant Physiology (Rockville) 125, 1198-1205)]. Many grass genes are similar enough at a nucleotide level to permit comparative grass gene mapping via cross species Southern blot hybridization (Gale, M. D. & Devos, K. M. (1998) Proceedings of the National Academy of Sciences of the United States of America 95, 1971-1974). It has been hypothesized that all grasses have approximately the same genes in roughly the same order (Devos, K. M. & Gale, M. D. (2000) Plant Cell 12, 637-646). This order may need to be deduced by a process called “consolidation” (Freeling, M. (2001) Plant Physiology (Rockville) 125, 1191-1197). Without being bound by theory, if different grasses have the same genes in the same order, then the morphological and physiological diversity of grasses is essentially allelic diversity distributed phylogenetically. Thus, it is of particular importance and interest to identify and to phylogenetically analyze those potential regulatory sequences responsible for the allelic diversity. There are now enough orthologous genomic DNA sequences in grasses, notably between maize and rice, to begin to address the question of grass gene regulatory diversity.

When genomic DNA sequences carrying orthologous genes in related organisms are compared, the exon sequences and the exon-intron structure line-up near-perfectly. The regions of gene that do not encode proteins, including the stretches of DNA that encode untranslated RNA, are called “noncoding sequence.” When these are compared—which can be done using a variety of algorithms, the general result is that noncoding regions show little or no obvious sequence homology. The exceptions are called CNSs, or conserved noncoding sequences. A recent review, focusing on man and mouse, has been written by Ross C. Harrison (Harrison, R. C., 2000. Trends in Genetics 16, 369-372: Conserved noncoding sequences are reliable guides to regulatory elements. The pioneering paper of G. G. Loots and coworkers (2000, Science 288, 136-140: Identification of coordinate regular of interleukins 4, 13 and 5 by cross-species comparisons.) showed that a CNS identified by man-mouse comparison actually bound a regulatory protein. Similarly, Kammandel and coworkers (Developmental Biology 205:79-97: Gottgens and coworkers (2001, genome Research 11, 87-97. Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restrictions endonucleases correspond precisely with peaks of conserved noncoding sequence) find that CNSs tend to be available to nucleases, as expected of regulatorily active chromatin. Distinct cis-essential modules direct the time-space pattern of the Pax6 gene activity.) showed function of a CNS highly conserved between a fish, mouse and man. Similarly, when dog is added to human-mouse comparisons, there remain a large number of CNS conservations, conservations that hold during the very approximately 100 million years of divergent evolution from a Cretacious (144-65 Mya) ancestor, but there have been questions as to how much of these CNSs are simply “leftover” from the ancestor, and do not reflect conservation for function. Dogs, mice and humans are species in different orders of mammals. All grasses are in the same family, and are thought to be related to a common grass ancestor 50-70 Mya, with rice and maize diverging approximately 50 Mya (E. A. Kellogg, 2001. Evolutionary history of the grasses. Plant Physiology 125, 1198-1205; M. Freeling, 2001. Grasses as a single genetic system: Reassessment 2001. Plant Physiology 125: 1191-97).

Identification of CNSs in Plants and Other Organisms

CNSs are identified in plants and other organisms by techniques familiar to workers of ordinary skill in the art. As detailed in the Example, in one non-limiting embodiment, CNSs are identified using a set of custom Perl scripts that perform the following operations: Input—Two genomic DNA sequences of the loci to be compared and on protein sequence corresponding to one of the genomic sequences are used as input for the program. Masking—The coding regions of the DNA sequences are identified using NCBI's BLASTX. The sequence in the coding regions is replaced with a string of N's, which will not be compared in the next step. In the comparison step, the DNA sequences are compared using NCBI's BLASTN, which identifies stretches of homologous sequence between the two masked sequences. A log file is generated containing these results for use in the next step. This step identifies sequence conservation of CNSs. In the graphical output step, the log file from the comparison step is used to generate a graphical representation showing which shows positional of CNSs. The search window is set at a desired size in BLAST, for example 7 bp. The two masked genomic sequences are then compared using, for example, NCBI's bl2seq (Tatusova, T. A. & Madden, T. L. (1999) FEMS Microbiology Letters 174, 247-250), with the following parameters: for finding stringent CNSs Word size=7, Gap Penalties: Existence=5, Extension=2; for finding long CNSs Word size=7, Gap Penalties: Existence=2, Extension=1. The positions of CNSs relative to intron/exon structure may be visualized using Perl scripts to parse the output of both the blastx and bl2seq results.

For the purposes of this invention, CNSs can be identified in any organism by many procedures including database scanning and comparisons, direct sequence comparisons, etc.

Uses of CNSs

The conserved noncoding sequences (CNSs) of the invention find many uses. In particular, the CNSs of the invention find use in identifying polymorphisms between two or more DNA sequences. The CNSs find further use in mapping including cross-species alignment of DNA sequences and alignment of syntenic regions outside of coding regions.

The CNSs may also be utilized in gene cloning including capturing or isolating of sets of related genes.

The CNSs may be also utilized for analyzing promoter sequences including (a) identifying functional cis-acting sequences within a promoter; (b) identifying binding sites for transcription factor complexes; (c) isolating or physically capturing DNA binding proteins/transcription factors; (d) linking binding sites to appropriate transcription factors via gene expression profiles; (e) designing promoters with novel expression characteristics and (f) titrating out transcription factors via replication of CNS-containing DNA.

Stretches of genomic information may have shared noncoding sequence motifs. These motifs may contain regulatory elements, such as cis-acting promoters or transcription factor complex binding sites. Thus, QTLs or genes in a region of genomic sequence may be under the control of common cis-acting promoter elements. The CNSs may be used to identify functional cis-acting sequences within a promoter or binding sites for transcription factor complexes. Additionally, through recognition of conserved nucleotide binding sites within noncoding regions, techniques known in the art may be used to capture the nucleotide binding proteins. The regulation of gene expression may be altered through modification of one or several CNS cis-acting regulation sites. QTL expression levels may be controlled by in-situ modification of CNS sites. For example, QTL expression may be altered by modification of CNS sequences. Alternatively, promoters may be specifically designed to contain novel expression characteristics. In addition, the regulatory elements contained within a CNS may be capable of regulating clusters of QTLS. Thus, by recognizing the regulation of QTLS, regulatory networks and pathways may be construed, such that clusters of gene sets which are co-regulated may be deciphered. Futhermore, in addition to using the CNSs for gene cloning directly, once a gene has been mapped, it can also be cloned. Positional gene cloning uses the proximity of a genetic marker to physically define a cloned chromosomal fragment that is linked to a QTL identified using the CNSs of the invention. Clones of linked nucleic acids have a variety of uses, including as genetic markers for identification of linked QTLs in subsequent marker assisted selection (MAS) protocols, and to improve desired properties in recombinant plants where expression of the cloned sequences in a transgenic plant affects an identified trait. Common linked sequences, which are desirably cloned, include open reading frames, e.g., encoding nucleic acids or proteins which provide a molecular basis for an observed QTL. If markers are proximal to the open reading frame, they may hybridize to a given DNA clone, thereby identifying a clone on which the open reading frame is located. If flanking markers are more distant, a fragment containing the open reading frame may be identified by constructing a contig of overlapping clones.

The CNSs of the invention further find use in identifying genes encoding RNA only; constructing regulatory networks and pathways and studying clustering and relationships of co-regulated gene sets.

The CNSs of the invention can also be utilized to alter the expression level and specificity of endogenous genes by in situ modification of CNSs.

The CNSs of the invention find particular use as primers. For use as amplification primers, conserved noncoding sequences of 15, 16, 17 and 18 mers and longer oligonucleotides find use in the invention. For computational analysis, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 mers and longer oligonucleotides find use in the invention. When smaller oligonnucleotides (e.g. 7 mers) are used in computational analysis clusters of oligomers are generally required in addition to the use of algorithms to interpret the data.

Identification of Polymorphisms

In one embodiment, the CNSs of the invention find use in identifying polymorphisms. As discussed above, a polymorphism is a change or difference between two related nucleic acids. In one format, the invention is directed to a method of identifying a polymorphism between two or more DNA sequences by a) selecting a first and a second primer for use in an amplification reaction wherein the first or second primer is a CNS in the two or more DNA sequences; b) hybridizing the first and second primers to the two or more DNA sequences to form primer-DNA hybrids; c) amplifying the primer-DNA hybrids to form amplified DNA products; and d) separating the DNA products by size to identify a polymorphism between the two or more DNA sequences.

Once isolated, such polymorphisms have many uses, including use in gene cloning and mapping.

Mapping

The CNSs of the invention are useful in mapping in all organisms including plants, animals, yeast and fungi. Such mapping includes classical Mendelian mapping and QTL mapping. (Patterson, A. H., 2002. What has QTL mapping taught us about plant domestication? New Phytologist 154: 591-608.) Traits are encoded by what are often called Quantitative Trait Loci or QTLS. The reason that QTLs are not simply called “genes” or “gene regions” is simply historical, because the difference in a quantitative trait is often encoded by multiple alleles at multiple genes, and these often involve complex patterns of interaction with modifier alleles. In fact, a QTL is most often but a single allelic difference penetrating through a great deal of noise.

Prior to the present invention, PCR-based QTL analyses were done in species where mapped primer pairs are available that amplify a fragment of genome that is rich in simple sequence repeats (SSRs) or insertions/deletions (Indels). A recent example of an SSR-based QTL map in maize is the work of Wang and coworkers, on the subject of mapping QTLs influencing elongation factor la content in the endosperm (Plant Physiology 125: 1271-1282). However, the maize SSR primers used not only did not adequately cover the maize genome, but were also useless very far outside the genus Zea. Historically, when the genetic maps of different grasses have been compared for the purpose of comparing them one-to-another, coding sequence has been used to probe by DNA-DNA hybridization Southern blots on which are segregating RFLP fragments. In contrast to the use of CNS in mapping as described for the present invention, this prior art method is not amplification-based and takes a very large amount of time. The typical map prepared utilizing RFLP fragments has markers every 20-30-map units; only a very course map emerges. Devos and Gale (2001. Plant Cell 121, 637-646: Genome relationships: the grass model in current research.) Have recently reviewed the comparative grass genome mapping data; all of these data derive from probing via hybridization with exon sequence.

In contrast to the prior art methods, mapping with CNSs is rapid, efficient and effective. Mapping with CNSs uses many of the basic techniques of mapping.

For plants, general mapping techniques are well known. In a representative first step, two different parental lines that can cross (P1 and P2) are selected. One parent should carry the trait or traits of interest, while the other plant should not carry this trait. These plants should represent races as different as is possible so that they will be likely to display polymorphism for the markers used in the mapping study. In a second step, a cross is made between P1×P2. The result is the F1 (hybrid), which will be a family or population of allelic pairs. In a subsequent step, a specific crossing protocol is initiated, of the many that can work (Patterson, 2002) and are available to one of ordinary skill in the art.

Next, the plants are selfed and F2 seed is collected. F2 plants are grown for character analysis and for DNA collection and is known as the F2 mapping population. Alternatively, individual F2's can be selected to get F3's. These F3's are selfed to get F4's etc., until recombinant inbred lines (RILs) are produced.

For dominant characters of interest, backcrossing to the plant (P) that does not express the character generates a BC1 family. This might be repeated (BC2) and perhaps RILs generated after the back crosses. The backcrosses are a good point to apply some selection to ensure that the trait of interest is expressing well in the mapping population.

There are many variations to the above mapping strategies within the scope of the invention. Most mapping strategies will generally utilize some of the steps outlined above. The subsequent step(s) differ depending on the markers used in the mapping study and the species mapped.

In one example, mapping may be carried out completely within a model species, where molecular tools are available, like in maize, rice, cows and the like.

Mapping with CNSs offers numerous advantages over prior techniques. In prior art mapping techniques, microsatelite markers covering all chromosomes were utilized. In these methods, characters are interval mapped using one of the many QTL programs, as has been done for twinning frequency in a population of cows (Uen et al., 2000), or using AFLPs for measuring levels of polymorphism between the crop foxtail millet and its weedy relatives (Le Theirry d'Ennequin et al., 2000), SSRs for mapping cold-tolerance in maize (Enoki et al., 2002), and for mapping differences between cultivated and wild rice, two species in the same genus (Brondanni et al, 2002). Brondanni, C, PHN Range[, RPV Brondani and ME Ferreira, 2002. QTL mapping and introgression of yield-related traits from Orya glumaepatula to cultivated rice using microsatellite markers. Theor Appl Genet 104: 1192-1203. However, in contrast to mapping with CNSS, the SNPs, SSRS, AFLPs and microsatalites used in these studies do not work on species in different tribes or subfamilies. Even when the markers are gene-anchored, as with some of the SSRs, the exon region that is polymorphic in one genus is simply not polymorphic in others.

CNS-AFLP analysis

In a preferred embodiment, CNS mapping is combined with AFLP mapping to produce a robust and effective mapping technique herein referred to as “CNS-AFLP mapping”. CNS-Amplified Fragment Length Polymorphism (AFLP) mapping analysis combines the efficiency of all amplification-based techniques with the gene-anchored quality of RFLP and allozyme markers.

Amplified Fragment Length Polymorphism is a DNA fingerprinting technique available from Keygene, B. V. which detects DNA restriction fragments by means of PCR amplification The AFLP® technology usually includes the following steps: (1) restriction of DNA with two restriction enzymes, generally a hexa-cutter and a tetra-cutter; (2) ligation of double-stranded (ds) adapters to the ends of the restriction fragments; (3) amplification of a subset of the restriction fragments using two primers complementary to the adapter and restriction site sequences, and extended at their 3′ ends by “selective” nucleotides; (4) gel electrophoresis of the amplified restriction fragments on denaturing polyacrylamide gels (“sequence gels”); (5) the visualization of the DNA fingerprints by means of autoradiography, phospho-imaging, or other methods.

Amplified Fragment Length Polymorphism technology is a random amplification technique. In contrast, CNS-AFLP mapping utilizes specific CNSs as primers and is, therefore, a specific amplification technique. CNS-AFLP primers make use of stringent PCR conditions. The CNS amplification primers are generally 15-21 nucleotides in length and anneal perfectly to their target sequences. This renders Amplified Fragment Length Polymorphism with CNS primers a very reliable and robust technique, which is unaffected by small variations in amplification parameters (e.g. thermal cyclers, template concentration, PCR cycle profile).

Restriction fragment patterns generated by means of the Amplified Fragment Length Polymorphism with CNS primers are a rich source of restriction fragment polymorphisms which can be utilized as markers. The frequency with which markers are detected depends on the level of sequence polymorphism between the tested DNA samples. The molecular basis of polymorphisms will usually be sequence polymorphisms at the nucleotide level.

The CNS-AFLP techniques can be used in a large number of applications, such as the use of CNS markers in genetic studies, such as biodiversity studies; the analysis of germplasm collections; the genotyping of individuals; genetic distance analyses; the identification of closely-linked DNA markers and the construction of genetic DNA marker maps. The CNS-AFLP techniques can also be utilized in the construction of physical maps using genomic clones such as YACs and BACs, in the precision mapping of genes, and the subsequent isolation of these genes.

One such CNS-AFLP marker is described in the Example: CNS3-Exon1 lg1-specific marker. This pair of PCR primers is known to work on many grasses from almost all subfamilies of grass. We have compared different accessions—cultivars or landraces—of two different species of Setaria: Setaria glauca and Setaria fabarii. When the PCR products were run out on gels, they all looked to be about the same size (FIG. 4A). However, when the products were cut with 4-cutter restriction enzyme, MspI, polymorphism at the lg1 gene was detected for both species. (FIG. 4B, Setaria CNS-AFLP polymorphisms). This is one CNS-AFLP marker anchored to the known grass gene lg1. Identifying one of these markers every 10 map units (130 or so), provides a pan-grass toolbox of CNS-AFLP mapping markers for low resolution mapping.

CNS-AFLP can be utilized in conjunction with traditional mapping techniques. For example, sort mapping of a population into those members of the population that express the character and those that do not express the character. In the method, DNA is prepared from each individual and a small fraction of this DNA is used as template for an amplification reaction with primer pairs that differentiate—via fragment lengths, between P1 and P2. If the character segregates and assorts independently of a marker polymorphism, there is no linkage. However, if one of the marker point happens to be right next to the locus determining the character, or to a QTL specifying some component of the character, or to some combination of alleles that—together—interact to specify a character, then the character can be mapped, however crudely, to a chromosome. Since CNS-AFLPs may be constructed for every gene of interest, for example grass genes, a crude map position can soon be reduced to more specific map positions. If a character or QTL can be mapped to about 1 map unit, that is approximately 147 genes (see Freeling, 2000 for calculation). Among these genes will be the gene or genes responsible for the character or component of character being mapped.

In one specific embodiment, CNS AFLP mapping can be utilized in a cross between one rare, unknown grass with another rare, unknown grass. The CNS-AFLP primer pairs will work just as well on any grass genome, and map the character efficiently onto the rice genome. General mapping techniques and strategies are described in the following references which are hereby incorporated by reference in their entirety. Crasta, OR, WW Xu, DT Rosaenow, i. Mullet and HT Nguyen, 1999. Mapping of post-flowering drought resistance traits in grain sorghum: association between QTLs influencing premature senescence and maturity. Molec Gen Genetics 262: 579-588. Devos, KM and MD Gale, 2000 Genome relationships: the grass model in current research. Plant Cell 12: 637-646. Devos, KM, ZM Wand, J. Beales, T. Sasaki and MD Gale, 1998. Comparative genetic maps of foxtail millet (Setaria italica) and rice. Theor Appi Genet 96: 63-68. Devos, K M, T S Pittaway, A Reynolds and M D Gale, 2000. Comparative mapping reveals a complex relationship between the pearl millet genome and those of foxtail millet and rice. Theor Appl Genet 100: 190-198. Foolad, M R and F. Q Chen, 1999. RFLP mapping of QTLs conferring salt-tolerance during the vegetative stage of tomato. Theor Appl Genet 99: 235-243. Freeling, M, 2001. Grasses as a single genetic system. Reassessment 2001. Plant Physiology 125: 1191-1197. Le Thierry d'Ennequin, O. Panaud and B. Toupance, 2000. Assessment of genetic relationships between Setaria italica and its wild relative S viridis using AFLP markers. Theoret. Appl. Genet. 100: 1061-1066. Uen, S, A. Karlsen, G. Klemetsdal, DI Vage, I. Olsaker, H. Klungland, M. Aasland, B Heringstad, J. Ruane and Lgomez-Raya, 2000. A primary screen of the bovine genome for quantitative trait loci affecting twinning rate. Mammalian Genome 11: 877-882. Poncet, V, f Lamy, KM Devos, MD Gale, A. Sarr and T. Robert, 2000. Genetic control of domestication traits in pearl millet. Theor Appl Genet 100: 147-159. Wang, R. L., Devos, K, Liu c., and M. Gale, 1998. Construction of RFLP-based maps of foxtail millet, Setaria italica (L.) Theoret. Appi. Genet 96: 31-36. Yadev, RS, C T Hash, F R Bidinger, G P Cavab abd C J Howarth, 2002. Quantitative trait loci associated with traits determining grain and stover yield in pearl millet under terminal drought-stress conditions. Theor Appl Genet 104: 67-83.

One way to map a character in an as yet unexplored plant is to link that character to some PCR fragment size, and then—after linkage is ascertained on a sequencing gel—one may go back and sequence the responsible PCR fragment to obtain the exon sequence that would anchor the marker to a real gene. By using combinations of promoter CNS primers and either exon primers or intron CNS primers, just such “exon-anchored” fragments are generated. If the polymorphisms cannot be seen in whole PCR fragments, than the fragments might be restricted with multiple 4-base cutters (restriction enzymes) until small fragments are generated, and analyzed on a sequencing gel. This method allows detection of small promoter/intron indels and SSRS, and the gene from which this fragment derived is reconstructed. Thus, a tool box of amplification primers may be prepared that will generate a diversity of amplification fragment sizes, and if any of these lengths mark a trait of interest, that fragment can be “reconstructed” and then the responsible amplification product sequenced to identify the gene, and thus the location in the ancestral grass gene order.

CNS-AFLP mapping offers several advantages over prior art techniques as indicated below. Marker selection is dependent upon whether or not the marker is gene-anchored, how difficult/expensive they are to use, and whether or not the marker will work outside of the species/genus for which it was designed to be polymorphic (other tribes?). As you can see from the list below, of all the molecular mapping marker “tool kits,” CNS-AFLP is gene-anchored and useful at the family level (like RFLPs are), but is also PCR-based (inexpensive and quick). Gene- Difficult?/ Method anchored? Expensive? Other tribes? RFLP yes yes yes SSR sometimes no no SNP yes no no Microsatellite no no no AFLP and relatives no no no CNS-AFLP yes no yes

A primary motivation for development of molecular markers in animal and crop species is the potential for increased efficiency in breeding through marker assisted selection (MAS). After appropriate CNSs have been identified as described above, the corresponding genetic marker alleles can be used to identify plants that contain the desired genotype at multiple loci and would be expected to transfer the desired genotype along with the desired phenotype to its progeny.

The presence and/or absence of a particular genetic marker allele in the genome of a plant or animal exhibiting a preferred phenotypic trait is made by using the CNS techniques described above. If the nucleic acids from the plarit hybridizes to a probe specific for a desired genetic marker, the plant can be selfed to create a true breeding line with the same genome or it can be crossed with a plant with the same QTL or with other desired characteristics to create a sexually crossed F₁ generation.

In another embodiment, CNS primers may be used in combination with other genetic markers, including RFLPs, AFLPs, isozymes, specific alleles and variable sequences. Mapping may be performed using high-throughput screening. In one embodiment, high throughput screening involves providing a library of genetic markers including RFLPs, AFLPS, isozymes, specific alleles and variable sequences, including SSR. Such “libraries” are then screened against plant genomes. Once the genetic marker alleles of a plant have been identified, a link between the marker allele and a desired phenotypic trait can be determined.

CNSs and Gene Cloning

The CNSs of the invention can be utilized to clone genes from related species. For example, a CNS from rice and maize can be utilized to clone the corresponding gene from another moncot by procedures well known in the art and described below. Futhermore, a CNS from Arabidopsis and tomato can be utilized to clone the corresponding gene from another dicot by similar procedures. Such procedures are described in, for example, Ausabel, et al. Current Protocols in Molecular Biology as described in detail below.

CNS primers may also be used in related genomes to amplify a genomic fragment. The CNS primers may be used to capture genes; gene segments (exons) or non-coding regions. The CNS primers may also be used with non-CNS primers to amplify a genomic fragment. Thus, it is possible to capture sets of related genes in different species using the conserved primers. Also, alignment of syntenic regions which lie outside of coding regions may be performed. Conserved noncoding regions are useful for phylogenetic studies where evolutionarily divergent or convergent genes are being studied.

In addition, any of the cloning or amplification strategies described herein are useful for creating contigs of overlapping clones, thereby providing overlapping nucleic acids which show the physical relationship at the molecular level for genetically linked nucleic acids. A common example of this strategy is found in whole organism sequencing projects, in which overlapping clones are sequenced to provide the entire sequence of a chromosome. In this procedure, a library of the organism's cDNA or genomic DNA is made according to standard procedures described, e.g., in the references above. Individual clones are isolated and sequenced, and overlapping sequence information is ordered to provide the sequence of the organism. See also, Tomb, et al., Nature 388:539-547 (1997) describing the whole genome random sequencing and assembly of the complete genomic sequence of Helicobacter pylori, Fleischmann, et al., Science 269:496-512 (1995) describing whole genome random sequencing and assembly of the complete Haemophilus influenzae genome; Fraser, et al., Science 270:397-403 (1995) describing whole genome random sequencing and assembly of the complete Mycoplasma genitalium genome and Bult, et al., Science 273:1058-1073 (1996) describing whole genome random sequencing and assembly of the complete Methanococcus jannaschii genome. Recently, Hagiwara and Curtis, Nucleic Acids Res. 24(12): 2460-2461 (1996) developed a “long distance sequencer” PCR protocol for generating overlapping nucleic acids from very large clones to facilitate sequencing, and methods of amplifying and tagging the overlapping nucleic acids into suitable sequencing templates. The methods can be used in conjunction with shotgun sequencing techniques to improve the efficiency of shotgun methods typically used in whole organism sequencing projects. As applied to the present invention, the techniques are useful for identifying and sequencing genomic nucleic acids genetically linked to the QTLs as well as “candidate” genes responsible for QTL expression.

Cross-Species Alignment of DNA Sequences Using CNSs.

The CNSs of the invention can also be utilized for cross-species alignment of DNA sequences. In one format, a complete, annotated sequence of one individual, of one species such as rice is obtained. In another format, shotgun sequence of a rice-related species is obtained. Two sequences are said to be related if they have about the same genes in about the same order on the chromosome; e.g. corn shares this relationship to rice. Next, the sequences are virtually aligned using the complete sequence of rice to identify CNSs. Such CNSs are effective tools for alignment of sequences.

Indentification of Genes that Encode RNA but Not Protein.

Any transcribed sequence that is not translated is an RNA gene. Ribosomal RNA subunits, spliceosomal RNAs, and other RNAs involved in gene silencing are well known cases. As with any other biologically important information, these RNA sequences will be conserved over evolutionary time, and will be identifiable as CNSs. Identification of genes encoding RNA is principally performed in two ways: 1) Computationally—RNAs are identified based on secondary structure, evolutionary conservation (CNS), or other sequence based methods (Eddy, 2002; Storz, 2002). 2) Experimentally—non-coding RNAs are physically isolated, purified, and sequenced (Lau et al., 2001; Lee and Ambros, 2001). In another format, reverse transcriptase PCR is utlized to determine if a CNS region is represented in the RNA pool (Bauer et al., 1994). This experimental verification step is needed because so little is known about genes that encode RNA only. Bauer, P, M D Crespe, J Szecsi, L A Allison, M. Schultze, P. Ratert, E. Kondorosi and A. Kondorosi, 1994. Alfalfa enod12 genes are differentially regulated during nodule development by Nodfactors and Rhizobium invasion. Plant Physiol. 105: 585-592.

Identification of Functional Cis-Acting Sequences Within a Promoter

Functional cis-acting sequences can be identified both computationally and experimentally based on evolutionary conservation of the sequences utilizing the CNSs of the invention. Generally, the following steps are followed: 1) Genomic sequences from orthologous genes from two or more related organisms is obtained. These sequences can be found in public DNA sequence databases or can be generated by sequencing the gene of interest from the organism's genomic DNA. 2) The sequences are then compared with one of many sequence comparison tools. BLAST (Altschul et al., 1997; Kaplinsky et al., 2002), and Vista (Dubchak et al., 2000; Mayor et al., 2000) are two of many publicly available tools (Levy et al., 2001; Sumiyama et al., 2001). The coding regions (exons) will be conserved. Outside of the coding regions, any conserved noncoding sequences (CNS) represent potentially functional cis-acting sequences. 3) The function of the identified non-coding sequences is then tested. Typically, transcription factor DNA binding sites are identified by gel shift assays. After identifying the promoter regions, the promoter region sequences can be employed in double-stranded DNA arrays to identify molecules that affect the interactions of the transcription factors with their promoters (Bulyk et al. (1999) Nature Biotechnology 17:573-577). In addition, this is accomplished by creating transgenic constructs with and without modified or non-modified cis-elements. This could also be tested by identifying individuals with altered cis-elements using a reverse genetics approach. If the cis-elements are functional, a difference in gene expression will reflect changes to the cis-elements.

Identification of Binding Sites for Transcription Factor Complexes

Transcription factor binding sites can be computationaly identified in many ways, including the comparative methods described above. Other methods rely on correlating gene expression profiles with sequences or by looking for clusters of known transcription factor (TF) binding sites or CNSs. Methods include 1) acquiring genomic sequences (see above, step 1). A database of TF binding sequences is needed for most of the analysis, and many are publicly available (Ghosh, 2000; Heinemeyer et al., 1999; Heinemeyer et al., 1998; Higo et al., 1999; Lescot et al., 2002; Wingender et al., 1997). 2) Sequences are analyzed using one of many methods including gel shift assays. These can include locating clusters of previously identified TF binding sites (Markstein et al., 2002; Zhu et al., 2002). Another method involves combining frequency and positional information to predict transcription-factor binding sites (Kielbasa et al., 2001). A third method is identifying CNSs (see above). These CNSs are conserved because they are biologically important, and one possible biological function they could have is a TF binding site.

The CNSs may be utilized to physically capture DNA binding proteins. There are many methods used to physically capture DNA binding TFs available to those of skill in the art. Most of them are based on using a specific cis-regulatory segment of DNA to enrich total protein from an organism for TFs that bind specifically to the DNA used (Heinekamp et al., 2002; Honda et al., 2001).

Construction of Regulatory Networks and Pathways

Expression data or intimate knowledge of a promoter and its trans-regulatory factors can be used to construct regulatory networks of genetic function both computationally and experimentally using CNSs. 1) Computationally—RNA expression profiles from multiple experiments are analyzed and co-regulated sets of genes (see clustering) are identified (Toh and Horimoto, 2002; Wyrick and Young, 2002). Genes that are co-regulated are considered to be regulated by the same regulatory upstream genes. By reiteratively elucidating the regulation of the upstream genes, a regulatory network is constructed. 2) Experimentally—A promoter is analyzed to determine the regions it contains that are biologically important for regulation. Combined with knowledge about the trans factors that bind these cis regulatory sequences, a regulatory network can be elucidated (Davidson et al., 2002).

Clustering of Coregulated Gene Sets

Clustering of coregulated genes is performed by analyzing gene expression data sets using the CNSs of the invention. Generally, RNA or protein expression data is acquired from various conditions or tissues. Next, this data is analyzed using one of many available computational tools (Azuaje, 2002; Xu et al., 2002).

Link Binding Sites to Appropriate Transcription Factors Via Gene Expression profiles

Binding sites may be linked to appropriate transcription factors via gene expression profiles by 1) generating dusters of co-regulated genes using the methods described above; 2) identifying cis-elements in the promoters/introns of coregulated genes using any available computational method (Birnbaum et al., 2001; Bussemaker et al., 2001; Harmer et al., 2000) and 3) Identifying the TFs that bind to these cis-elements by physically capturing them (above) or by searching binding site databases (for example TRANSFAC, ooTFD, and PLACE) (Ghosh, 2000; Heinemeyer et al., 1999; Heinemeyer et al., 1998; Higo et al., 1999; Wingender et al., 1997) for the TFs that correspond to cis-element sequences.

Phylogenetic Analysis of a Promoter.

Promoters may be analyzed by various methods. In a first method, a hybridization probe is required to identify the target sequence. This method is done individually for each species to isolate the related fragment to compare the promoter sequences. In step 1, the desired fragment to be cloned from the genomic DNA (usually done by a hybridization based method) is identified. In step 2, the genomic DNA of the corresponding size range is isolated. In step 3, the DNA is subcloned into a phage vector, for example, using the Stratagene Lambda Zap Express protocol available from Stratagene, San Diego Calif. In step 4, the library is screened by hybridization for the correct insert. In step 5, the DNA is sequenced. In step 6, the sequence is compared with a reference sequence (possibly a distantly related species). In step 7, Blast or other algorithms are utilized to identify CNSs.

In a second method, inverse PCR is utilized. A small amount of sequence adjacent to the sought after sequence is required to design PCR primers. This method is done individually for each species to isolate the related fragment to compare the promoters. In step 1, the genomic DNA is digested with a restriction enzyme. In step 2, the DNA is ligated back together under conditions that favor intramolecular ligation (within the same piece of DNA). In step 3, the ligated circular DNA is used as templates for a PCR reaction by amplifying with the primers facing outward (away from the known sequence). In step 4, it is confirmed that the amplified sequence is correct by using the amplified fragment in another round of PCR using nested primers to amplify. In step 5, the DNA is sequenced. In step 6, the sequences are compared with a reference sequence (possibly a distantly related species). In step 7, Blast or other algorithms are used to identify CNSs.

In a third method, one or more CNSs are utilized in a PCR reaction. In this method, the CNS is identified using the methods described above and is used to amplify the promoter region from any related species. Multiple species are analyzed in parallel so the analysis is much faster. In step 1, the identified CNS (facing downstream towards the exons) and an exon 1 conserved primer (facing upstream towards the promoter) in a PCR reaction is utilized. In step 2, the identity of the amplified fragment is confirmed as correct by hybridizing the amplified fragments with an exon 1 conserved probe. In step 3, the DNA is sequenced. In step 4. The sequences are compared with the other amplified sequences and the original two reference sequences.

Biochemical Techniques for Use of the CNSs

Various techniques available and known to those of ordinary skill in the art may be utilized in the techniques of the invention. Such techniques include the following:

(a) Gel Electrophoresis

Techniques for gel electrophoresis of DNA, e.g., following a CNS hybridization reaction, are well known in the art. See generally, CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, pp. 2.5.1-2.5.17, 5.4.1-5.4.4, 7.0.3-7.0.11, 7.6.1-7.6.9, 15.8.3 and 15.8.4-15.8.5 (Ausubel, et al., eds. John Wiley & Sons, 1994). Both polyacrylamide and agarose gel electrophoresis can be used to separate the selectively amplified DNA restriction fragments. The composition of the gel is chosen based on the degree of resolution that is needed. Agarose can separate DNA strands that are 50-100 nucleotides different in size, unless special materials are used. Acrylamide can routinely separate molecules which differ by 1-2 bases. Following electrophoresis, the fragments can be visualized by a number of staining techniques known in the art. For example, silver staining can be used to visualize DNA on a polyacrylamide gel. See BioTechniques 17(5):915 (1994). In a preferred embodiment, 4.5% polyacrylamide gels are fixed in 10% ethanol/0.5% acetic acid for 5-10 minutes. Gels are then incubated in 10% ethanol/0.5% acetic acid/0.25% silver nitrate for 5-10 minutes. Gels are then rinsed twice with deionized water for less than one minute. Gels are developed in 3% sodium hydroxide/1% formaldehyde until bands appear (5-10 minutes). Following this, gels are incubated in a fixing solution (10% ethanol/0.5% acetic acid) for 5 minutes and washed in deionized water for 10 minutes.

Bands can also be visualized by using fluorescent dNTPs during the PCR reaction. To visualize the bands the gel can be continuously exposed to UV light. Additional information on fluorescent labeling techniques are described, supra. Another technique is labeling one of the PCR primers with T4 kinase and P³³ or P³², exposing the gel to film, and marking the bands using pins which have been dipped in India ink prior to excision. This technique is useful when a single unique band is desired and high sensitivity is needed. It is, however, more tedious than silver staining for isolating all polymorphic DNA strands amplified with any given primer pair.

Each band visualized on the electrophoresis gel represents a population of DNA fragments of approximately the same size. In selecting DNA bands visualized on an electrophoresis gel that are unique to the population of interest (polymorphisms), visual comparison of DNA gel electrophoresis patterns is employed, using techniques known in the art. Polymorphisms useful as markers are selected on the basis of their visibility on the electrophoresis gel and their ability to reproducibly hybridize. Primer pairs are chosen for ability to amplify a large number of bands polymorphic between a heterogeneous set of inbreds and for amplification of few highly labeled monomorphic bands which can compete for nucleotides during the amplification process.

(b) Band Isolation and Identification

Individual bands visualized on an electrophoresis gel are cut out of the gel, e.g., using a scalpel, and amplified, e.g., using PCR, LCR, cloning, or the like. Using PCR as an example, the DNA is amplified by placing the gel piece directly into a reaction vessel containing the PCR reagents and appropriate CNS primers. Typically, a selective primer (e.g., a Plus 1 or Plus 3 primer) corresponding to that used in the CNS technique to produce the DNA fragments which were electrophoresed is used as a primer for the PCR amplification of the DNA in the gel band. In a preferred embodiment, band amplification reactions contain the band cut out of the gel, plus 3 primer, deoxy nucleotide triphosphates (dNTPs), Hot Tub or Taq polymerase (Amersharm, Perkin Elmer or Boeinger Mannheim), and buffer. Bands are amplified using 5 cycles of 94° C. (30 s), 58° C. (30 s), and 72° C. (60 s); 5 cycles of 94° C. (30 s), 56° C. (30 s), and 94° C. (60 s) and 20 cycles of 94° C. (30 s), 50° C. (30 s), and 72° C. (60 s).

Variations in the exemplar amplification technique used will be readily apparent to one skilled in the art and several variations are set forth herein. For example, different polymerase enzymes can be used and the reaction conditions can be varied to optimize amplification of DNA contained in the gel band. Additional information on PCR amplification are described supra. Similarly, one of skill will be able to clone isolated DNAs, or amplified DNAs.

Amplification products are run on an agarose gel, e.g., preferably about a 1% agarose gel, to confirm successful amplification. If a band is seen, the products are optionally re-amplified e.g., using the modified primers for Ligation Independent Cloning (Pharmingen). To modify primers, a 13 bp DNA segment complementary to the ends of the pPMG-LIC vector is added to the Plus 1 primer according to the manufacture's instructions (Pharmigen). The product from the second amplification is checked on a 1% gel, and then purified using the. Qiaquick PCR Purification Kit (Qiagen). The purified PCR products are then quantified e.g., by reading Hoechst dye (bis-Benzamide) fluorescence with a Dynatech MicroFLUOR Reader.

More generally, in vitro amplification techniques suitable for amplifying sequences for use as molecular probes (e.g., from isolated, amplified or cloned AFLP fragments, or naturally occurring sequences which map to unique loci) or generating nucleic acid fragments for subsequent subcloning are available. Examples of techniques sufficient to direct persons of skill through such in vitro amplification methods, including the polymerase chain reaction (PCR) the ligase chain reaction (LCR), Qβ-replicase amplification and other RNA polymerase mediated techniques (e.g., NASBA) are found in Berger, Sambrook, and Ausubel, as well as Mullis et al., (1987) U.S. Pat. No. 4,683,202; PCR Protocols A Guide to Methods and Applications (Innis et al. eds) Academic Press Inc. San Diego, Calif. (1990) (Innis); Amheim & Levinson (Oct. 1, 1990) C&EN 36-47; The Journal Of NIH Research (1991) 3, 81-94; (Kwoh et al. (1989) Proc. Natl. Acad. Sci. USA 86, 1173; Guatelli et al. (1990) Proc. Natl. Acad. Sci. USA 87, 1874; Lomell et al. (1989) J. Clin. Chem 35, 1826; Landegren et al., (1988) Science 241, 1077-1080; Van Brunt (1990) Biotechnology 8, 291-294; Wu and Wallace, (1989) Gene 4, 560; Barringer et al. (1990) Gene 89, 117, and Sooknanan and Malek (1995) Biotechnology 13: 563-564. Improved methods of cloning in vitro amplified nucleic acids are described in Wallace et al., U.S. Pat. No. 5,426,039. One of skill will appreciate that essentially any RNA can be converted into a double stranded DNA suitable for restriction digestion, PCR expansion and sequencing using reverse transcriptase and a polymerase. See, Ausbel, Sambrook and Berger, all supra. Further details on these procedures are found supra.

Any of these amplification techniques can also be used to generate amplified mixtures of DNA (i.e., from which amplified bands are isolated, or against which probes are hybridized).

(c) Cloning Isolated Bands

Cloning methodologies for cloning DNAs from CNS amplified gel bands (or amplicons of such bands), and for replicating nucleic acids useful as probes, as well as sequencing methods to verify the sequence of cloned nucleic acids are well known in the art. Examples of appropriate cloning and sequencing techniques, and instructions sufficient to direct persons of skill through many cloning exercises are found in Berger and Kimmel, Guide to Molecular Cloning Techniques, Methods in Enzymology volume 152 Academic Press, Inc., San Diego, Calif. (Berger); Sambrook et al. (1989) Molecular Cloning—A Laboratory Manual (2nd ed.) Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor Press, N.Y., (Sambrook); and Current Protocols in Molecular Biology, F. M. Ausubel et al., eds., Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., (through and including the 1997 Supplement) (Ausubel). A catalogue of Bacteria and Bacteriophages useful for cloning is provided, e.g., by the ATCC, e.g., The ATCC Catalogue of Bacteria and Bacteriophage (1992) Gherna et al. (eds) published by the ATCC. Additional basic procedures for sequencing, cloning and other aspects of molecular biology and underlying theoretical considerations are also found in Lewin (1995) Genes V Oxford University Press Inc., New York (Lewin); and Watson et al. (1992) Recombinant DNA Second Edition Scientific American Books, NY.

Most DNA sequencing today is carried out by chain termination methods of DNA sequencing. The most popular chain termination methods of DNA sequencing are variants of the dideoxynucleotide mediated chain termination method of Sanger. See, Sanger et al. (1977) Proc. Nat. Acad. Sci., USA 74:5463-5467. For a simple introduction to dideoxy sequencing, see, Current Protocols in Molecular Biology, F. M. Ausubel et al., eds., Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., (Supplement 37, current through 1997) (Ausubel), Chapter 7. Thousands of laboratories employ dideoxynucleotide chain termination techniques. Commercial kits containing the reagents most typically used for these methods of DNA sequencing are available and widely used.

(d) Amplification Detection Strategies

In a preferred embodiment, a polymorphic nucleotide is detected by amplifying the polymorphic nucleotide and detecting the resulting amplicon using the CNSs of the invention. A variety of variations on this strategy are used to detect polymorphic nucleic acids, depending on the materials available, and the like. In typical cases, a biological nucleic acid is amplified. Example biological nucleic acids are derived, e.g., from cDNA, genomic DNA isolated from a plant or microbe, genomic DNA isolated from a plant extract or microbial extract, genomic DNA isolated from an isolated plant tissue, genomic DNA isolated from an isolated plant tissue extract, genomic DNA isolated from a plant cell culture, genomic DNA isolated from a plant cell culture extract, genomic DNA isolated from a recombinant cell comprising a nucleic acid derived from a plant, genomic DNA isolated from a plant seed, genomic DNA isolated from an extract of a recombinant plant cell comprising a nucleic acid derived from a plant, genomic DNA isolated from an animal, genomic DNA isolated from an animal extract, genomic DNA isolated from an isolated animal tissue, genomic DNA isolated from an isolated animal tissue extract, genomic DNA isolated from an animal cell culture, genomic DNA isolated from an animal cell culture extract, genomic DNA isolated from a recombinant animal cell comprising a nucleic acid derived from an animal, genomic DNA isolated from an animal egg, genomic DNA isolated from an extract of a recombinant animal cell. Certain types of sources are preferred, depending on the application. For example, plant tissues or seeds are preferred for performing marker assisted selection of crops. Animal tissues are preferred for performing marker assisted selection of animals.

In one embodiment, nucleic acid primers which hybridize to regions of a genomic nucleic acid that flank a polymorphic nucleotide to be detected are used in PCR LCR, or other amplification reactions to generate an amplicon comprising a polymorphic nucleotide to be detected. A variety of other PCR and LCR strategies are known in the art and are found in Berger, Sambrook, Ausubel, and Innis, all supra. See also, as Mullis et al., (1987) U.S. Pat. No. 4,683,202, U.S. Pat. No. 4,683,195, PCR TECHNOLOGY 1-31 (Henry A. Edich ed., Stockton Press 1989). In brief, a nucleic acid having a polymorphic nucleic acid to be detected (a genomic DNA, a genomic clone, a genomic amplicon a cDNA, or the like) is hybridized to primers which flank the polymorphic nucleotide to be detected (e.g., nucleotide polymorphisms). As discussed, amplification mixtures are also appropriate, in which several amplicons are simultaneously querried in a given assay. Detection is typically performed by hybridizing amplification reaction products to a selected probe, or to multiple probes as described supra. Alternatively, an acrylamide or agarose gel can be used to size separate reaction products (although this decreases throughput, and is therefore, often undesirable); the products can be detected by allele-specific hybridization, by allele-specific hybridization to a polymer array as described supra, or by sequencing the PCR amplicons (using standard Sanger dideoxy or Maxam-Gilbert methods). Amplicons are optionally cloned or sequenced by any of a variety of protocols as described supra for bands isolated from AFLP gels.

Once an amplicon is sequenced, the sequence is optionally used to select primers complementary to the amplicon, i.e., primers which will hybridize to the amplicon. It is expected that one of skill is thoroughly familiar with the theory and practice of nucleic add hybridization and primer selection. Gait, ed. Oligonucleotide Synthesis: A Practical Approach, IRL Press, Oxford (1984); W. H. A. Kuijpers Nucleic Acids Research 18(17), 5197 (1994); K. L. Dueholm]. Org. Chem. 59, 5767-5773 (1994); S. Agrawal (ed.) Methods in Molecular Biology, volume 20; and Tijssen (1993) Laboratory Techniques in biochemistry and molecular biology-hybridization with nucleic acid probes, e.g., part I chapter 2 “overview of principles of hybridization and the strategy of nucleic acid probe assays”, Elsevier, New York provide a basic guide to nucleic acid hybridization. Innis, supra, provides an overview of primer selection.

Primer sequences are optionally selected to hybridize only to a perfectly complementary DNA, with the nearest mismatch hybridization possibility from known DNA sequence typical having at least about 50 to 70% hybridization mismatches, and preferably 100% mismatches for the terminal 5 nucleotides at the 3′ end of the primer.

PCR primers are optionally selected so that no secondary structure forms within the primer. Self-complementary primers have poor hybridization properties, because the complementary portions of the primers self hybridize (i.e., form hairpin structures). Primers are selected to have minimal cross-hybridization, thereby preventing competition between individual primers and a template nucleic acid and preventing duplex formation of the primers in solution, and possible concatenation of the primers during PCR. If there is more than one constant region in the primer, the constant regions of the primer are selected so that they do not self-hybridize or form hairpin structures.

One of skill will recognize that there are a variety of possible ways of performing the above selection steps, and that variations on the steps are appropriate. Most typically, selection steps are performed using simple computer programs to perform the selection as outlined above; however, all of the steps are optionally performed manually. One available computer program for primer selection is the MacVector™. program from Kodak. In addition to programs for primer selection, one of skill can easily design simple programs for any or all of the preferred selection steps.

One of skill will recognize that a wide variety of amplicons are provided by the present invention. In particular, amplicons are generated with primers flanking polymorphic nucleic acids which are identified by the methods herein. The amplicons can be generated by exponential amplification as described in the examples herein, or by linear amplification using a single specific primer, or by using one of the example primers below in conjunction with a set of random primers.

It will be appreciated that amplicons are characterized by a variety of physicochemical properties, including, but not limited to the following. First, the amplicons of the invention are produced in an amplification reaction using the primers as described above, with genomic or cDNA nucleic acid as a template (or a derivative thereof, such as a cloned or in vitro amplified genomic or cDNA nucleic acid). Second, single stranded forms of the amplicons (e.g., denatured amplicons) hybridize under stringent conditions to marker nucleic acids. Conditions for specific hybridization of nucleic acids, including amplicon nucleic acids are described above. A third physicochemical property of amplicons of the invention is that they specifically hybridize to one or more of the AFLP fragments identified using the methods herein.

In another embodiment, LCR is used to amplify a polymorphic nucleic acid or a mixture of polymorphic nucleic acids. By detecting the amplification product, presence of the polymorphic nucleotide is confirmed. Detection is typically performed by hybridizing LCR reaction products to a marker probe; alternatively, LCR products can be run on an acrylamide or agarose gel and the size of the reaction products detected (although this decreases throughput in some applications, and is, therefore, often undesireable), or the products can be detected by allele-specific hybridization, by allele-specific hybridization to a polymer array as described supra, or by sequencing the LCR amplicons (using standard Sanger dideoxy or Maxam-Gilbert methods). Detection techniques such as PCR amplification or other in vitro amplification methods are also used to detect LCR products.

The ligation chain reaction (LCR; sometimes denoted the “ligation amplification reaction” or “LAR”) and related techniques are used as diagnostic methods for detecting single nucleotide variations in target nucleic acids. LCR provides a mechanism for linear or exponential amplification of a target nucleic add, or a mixture of DNAs comprising a target nucleic acid, via ligation of complementary oligonucleotides hybridized to a target. This amplification is performed to distinguish target nucleic acids that differ by a single nucleotide, providing a powerful tool for the analysis of genetic variation in the present invention, i.e., for distinguishing polymorphic nucleotides.

The principle underlying LCR is straightforward: Oligonucleotides which are complementary to adjacent segments of a target nucleic acid are brought into proximity by hybridization to the target, and ligated using a ligase. To achieve linear amplification of the nucleic acid, a single pair of CNS oligonucleotides which hybridize to adjoining areas of the target sequence are employed: the oligonucleotides are ligated, denatured from the template and the reaction is repeated. To achieve exponential amplification of the target nucleic acid two pairs of CNS oligonucleotides (or more) are used, each pair hybridizing to complementary sequences on e.g., a double-stranded target polynucleotide. After ligation and denaturation, the target and each of the ligated oligonucleotide pairs serves as a template for hybridization of the complementary oligonucleotides to achieve ligation. The ligase enzyme used in performing LCR is typically thermostable, allowing for repeated denaturation of the template and ligated oligonucleotide complex by heating the ligation reaction. To amplify a mixture of nucleic acids, multiple primers are used (e.g., random primers, or primers comprising arbitrary nucleotides as described supra.

LCR is useful as a diagnostic tool in the detection of genetic variation. Using LCR methods, it is possible to distinguish between target polynucleotides which differ by a single nucleotide at the site of ligation. Ligation occurs only between oligonucleotides hybridized to a target polynucleotide where the complementarity between the oligonucleotides and the target is perfect, enabling differentiation between allelic variants of a gene or other chromosomal sequence. The specificity of ligation during LCR can be increased by substituting the more specific NAD⁺-dependant ligases such as E. coli ligase and (thermostable) Taq ligase for the less specific T4 DNA ligase. The use of NAD analogues in the ligation reaction further increases specificity of the ligation reaction. See, U.S. Pat. No. 5,508,179 to Wallace et al.

Finally, multiple LCR reactions can be run simultaneously in a single reaction, or in parallel reactions for simultaneous detection of any or all of the nucleotide polymorphisms described herein.

Nucleotide polymorphisms are also detected using other in vitro detection methods, including TAS, 3SR and Qβ amplification. (TAS), the self-sustained sequence replication system (3SR) and the Qβ replicase amplification system (QB), are reviewed in The Journal Of NIH Research (1991) 3, 81-94. The present invention may be practiced in conjunction with TAS (Kwoh, et al. (1989) Proc. Natl. Acad. Sci. USA 86, 1173 or the related 3SR (Guatelli et al. (1990) Proc. Natl. Acad. Sci. USA 87, 1874) for detecting single-base alterations in target nucleic acids by transcribing the target, annealing oligonucleotide primers to the transcript and ligating the annealed primers. QB replication (Lomell et al. (1989) J. Clin. Chem 35, 1826) may also be used in conjunction with the ligation methods of the present invention to detect mismatches by performing QB amplification on DNA ligated by the methods of the present invention.

(d) Labeling and Detecting Probes

DNA from a CNS amplification reacton band can be amplified and labeled in several ways. The DNA that is labeled can come from the following sources: 1) Amplification product using DNA from a gel piece as template, 2) Amplification product from an amplification where the template is a plasmid containing an CNS amplification band as an insert, 3) Plasmid DNA from a clone that contains an CNS amplification band as an insert, and 4) Oligonucleotide synthesis of subsequences of an CNS amplification band.

Several preferred methods can be used to label and detect the DNA from an CNS amplification band, including: 1) Chemiluminescence [using both Horseradish Peroxidase and/or Alkaline Phosphatase with substrates that produce photons as breakdown products][kits available from Amersham, Boehringer-Mannheim, and Life Technologies/Gibco BRL], 2) Color production [using both Horseradish Peroxidase and/or Alkaline Phosphatase with substrates that produce a colored precipitate] [kits available from Life Technologies/Gibco BRL, and Boehringer-Mannheim], 3) Chemifluorescence using Alkaline Phosphatase and the substrate AttoPhos [Amersham] or other substrates that produce fluorescent products, 4) Fluorescence [using Cy-5 [Amersham], fluorescein, and other fluorescent tags], 5) Radioactivity using end-labeling, nick translation, random priming, or PCR to incorporate radioactive molecules into the probe DNA/oligonucleotide. Other methods for labeling and detection will be readily apparent to one skilled in the art.

Amplification products are preferably diluted {fraction (1/10)}, resulting in a concentration of approximately 10 ng DNA/ml. The oligo probes are diluted to a concentration of approximately 1.5 mM. The amount of these dilutions required to easily detect products is 1 UL of dilution for every 1 ml of hybridization solution used. Labeling methods such as the ECL Direct Nucleic Acid Labeling and Detection System (Amersham Corporation, 2636 Clearbrook Drive, Arlington Heights, Ill.), a horseradish peroxidase chemiluminescence system, can be used for labeling and detecting amplified bands. Methods of labeling oligonucleotides, such as the alkaline phosphatase (AP) system (E-Link kit Oligonucleotide Conjugation Kit from Genosys, Europe), can also be used.

More generally, a probe for use in an in situ detection procedure, an in vitro amplification procedure (PCR, LCR, NASBA, etc.), hybridization techniques (allele-specific hybridization, in situ analysis, Southern analysis, northern analysis, etc.) or any other detection procedure herein, including CNS amplification fragments, can be labeled with any composition detectable by spectroscopic, photochemical, biochemical, immunochemical, electrical, optical or chemical means. Useful labels in the present invention include spectral labels such as fluorescent dyes (e.g., fluorescein isothiocyanate, Texas red, rhodamine, dixogenin, biotin, and the like), radiolabels (e.g., ³H, ¹²⁵I, ³⁵S, ¹⁴C, ³²P, ³³P, etc.), enzymes (e.g., horse-radish peroxidase, alkaline phosphatase etc.) spectral calorimetric labels such as colloidal gold or colored glass or plastic (e.g. polystyrene, polypropylene, latex, etc.) beads. The label may be coupled directly or indirectly to a component of the detection assay (e.g., a probe, primer, isolated DNA, amplicon, YAC, BAC or the like) according to methods well known in the art. As indicated above, a wide variety of labels may be used, with the choice of label depending on sensitivity required, ease of conjugation with the compound, stability requirements, available instrumentation, and disposal provisions. In general, a detector which monitors a probe-target nucleic acid hybridization is adapted to the particular label which is used. Typical detectors include spectrophotometers, phototubes and photodiodes, microscopes, scintillation counters, cameras, film and the like, as well as combinations thereof. Examples of suitable detectors are widely available from a variety of commercial sources known to persons of skill. Commonly, an optical image of a substrate comprising a nucleic acid array with particular set of probes bound to the array is digitized for subsequent computer analysis.

Because incorporation of radiolabeled nucleotides into nucleic acids is straightforward, this detection represents a preferred labeling strategy. Exemplar technologies for incorporating radiolabels include end-labeling with a kinase or phoshpatase enzyme, nick translation, incorporation of radio-active nucleotides with a polymerase and many other well known strategies.

Fluorescent labels are also preferred labels, having the advantage of requiring fewer precautions in handling, and being amendable to high-throughput visualization techniques. Preferred labels are typically characterized by one or more of the following: high sensitivity, high stability, low background, low environmental sensitivity and high specificity in labeling. Fluorescent moieties, which are incorporated into the labels of the invention, are generally are known, including Texas red, dixogenin, biotin, 1- and 2-aminonaphthalene, p,p′-diaminostilbenes, pyrenes, quaternary phenanthridine salts, 9-aminoacridines, p,p′-diaminobenzophenone imines, anthracenes, oxacarbocyanine, merocyanine, 3-aminoequilenin, perylene, bis-benzoxazole, bis-p-oxazolyl benzene, 1,2-benzophenazin, retinol, bis-3-aminopyridinium salts, hellebrigenin, tetracycline, sterophenol, benzimidazolylphenylamine, 2-oxo-3-chromen, indole, xanthen, 7-hydroxycoumarin, phenoxazine, calicylate, strophanthidin, porphyrins, triarylmethanes and flavin. Individual fluorescent compounds which have functionalities for linking to an element desirably detected in an apparatus or assay of the invention, or which can be modified to incorporate such functionalities include, e.g., dansyl chloride; fluoresceins such as 3,6-dihydroxy-9-phenylxanthydrol; rhodamineisothiocyanate; N-phenyl 1-amino-8-sulfonatonaphthalene; N-phenyl 2-amino-6-sulfonatonaphthalene; 4-acetamido-+isothiocyanato-stilbene-2,2′-disulfonic acid; pyrene-3-sulfonic acid; 2-toluidinonaphthalene-6-sulfonate; N-phenyl-N-methyl-2-aminoaphthalene-6-sulfonate; ethidium bromide; stebrine; auromine-0,2-(9′-anthroyl)palmitate; dansyl phosphatidylethanolamine; N,N′-dioctadecyl oxacarbocyanine: N,N′-dihexyl oxacarbocyanine; merocyanine, 4-(3′-pyrenyl)stearate; d-3-aminodesoxy-equilenin; 12-(9′-anthroyl)stearate; 2-methylanthracene; 9-vinylanthracene; 2,2′(vinylene-p-phenylene)bisbenzoxazole; p-bis(2-(4-methyl-5-phenyl-oxazolyl))benzene; 6-dimethylamino-1,2-benzophenazin; retinol; bis(3′-aminopyridinium) 1,10-decandiyl diiodide; sulfonaphthylhydrazone of hellibrienin; chlorotetracycline; N-(7-dimethylamino-4-methyl-2-oxo-3-chromenyl)maleimide; N-(p-(2benzimidazolyl)-phenyl)maleimide; N-(4-fluoranthyl)maleimide; bis(homovanillic acid); resazarin; 4-chloro-7-nitro-2,1,3-benzooxadiazole; merocyanine 540; resorufin; rose bengal; and 2,4-diphenyl-3(2H)-furanone. Many fluorescent tags are commercially available from SIGMA chemical company (Saint Louis, Mo.), Molecular Probes, R&D systems (Minneapolis, Minn.), Pharmacia LKB Biotechnology (Piscataway, N.J.), CLONTECH Laboratories, Inc. (Palo Alto, Calif.), Chem Genes Corp., Aldrich Chemical Company (Milwaukee, Wis.), Glen Research, Inc., GIBCO BRL Life Technologies, Inc. (Gaithersberg, Md.), Fluka Chemica-Biochemika Analytika (Fluka Chemie AG, Buchs, Switzerland), and Applied Biosystems (Foster City, Calif.) as well as other commercial sources known to one of skill.

In one embodiment, nucleic acids are labeled by culturing recombinant cells which encode the nucleic acid in a medium which incorporates fluorescent or radioactive nucleotide analogues in the growth medium, resulting in the production of fluorescently labeled nucleic acids. Similarly, nucleic acids are synthesized in vitro using a primer and a DNA polymerase such as taq. For example, Hawkins et al. U.S. Pat. No. 5,525,711 describes pteridine nucleotide analogs for use in fluorescent DNA probes, including PCR amplicons.

The label is coupled directly or indirectly to a molecule to be detected (a product, substrate, enzyme, or the like) according to methods well known in the art. As indicated above, a wide variety of labels are used, with the choice of label depending on the sensitivity required, ease of conjugation of the compound, stability requirements, available instrumentation, and disposal provisions. Non radioactive labels are often attached by indirect means. Generally, a ligand molecule (e.g., biotin) is covalently bound to a nucleic acid such as a probe, primer, amplicon, YAC, BAC or the like. The ligand then binds to an anti-ligand (e.g., streptavidin) molecule which is either inherently detectable or covalently bound to a signal system, such as a detectable enzyme, a fluorescent compound, or a chemiluminescent compound. A number of ligands and anti-ligands can be used. Where a ligand has a natural anti-ligand, for example, biotin, thyroxine, and cortisol, it can be used in conjunction with labeled, anti-ligands. Alternatively, any haptenic or antigenic compound can be used in combination with an antibody. Labels can also be conjugated directly to signal generating compounds, e.g., by conjugation with an enzyme or fluorophore or chromophore. Enzymes of interest as labels will primarily be hydrolases, particularly phosphatases, esterases and glycosidases, or oxidoreductases, particularly peroxidases. Fluorescent compounds include fluorescein and its derivatives, rhodamine and its derivatives, dansyl, umbelliferone; etc. Chemiluminescent compounds include luciferin, and 2,3-dihydrophthalazinediones, e.g., luminol. Means of detecting labels are well known to those of skill in the art. Thus, for example, where the label is a radioactive label, means for detection include a scintillation counter or photographic film as in autoradiography. Where the label is optically detectable, typical detectors include microscopes, cameras, phototubes and photodiodes and many other detection systems which are widely available.

(e) Evaluation of DNA Fragments Isolated from Gel

Each DNA gel band that has been amplified and cloned is evaluated for its utility as a polynucleotide probe in a dot blot hybridization assay, or one of the other assays described herein. Two parameters are often evaluated: the ability of each potential probe to hybridize to (1) a set of fingerprinting inbred plants amplified with the appropriate Plus 3 primers, that result in a specific pattern of positives and negatives and (2) the amplified DNA from the gel band used to generate the polynucleotide probe. This hybridization can be evaluated using the dot blot hybridization procedure described in this specification. A testing membrane is created by separately amplifying DNA from the fingerprinting inbreds and making a hand dot blot with (1) the amplification products from a single primer pair and (2) the amplified DNA gel bands isolated from a single primer pair. (1) and (2) are immobilized onto a membrane. An amplified band that specifically recognizes itself (i.e., the band from which the probe was isolated or designed) and specifically hybridizes to the fingerprinting set of inbreds is considered useful as a polynucleotide probe for dot blots in the method of the invention. These bands are considered dominant markers. Bands that when hybridized to the testing membranes produce more positive inbreds than expected and recognize two or more band dots can be evaluated to determine if the additional positive bands belong to a co-dominant marker. Co-dominant markers are not useful in the method of the invention as polynucleotide probes for dot blot assays unless specific oligonucleotide sequences are designed for each allele.

Sources of CNSs

The CNSs of the invention can be isolated from any organism including yeast, bacteria, animals and plants.

Preferred sources from which the plant CNSs of the invention can be obtained or isolated include, but are not limited to, corn (Zea mays), Brassica sp. (e.g., B. napus, B. rapa, B. juncea), particularly those Brassica species useful as sources of seed oil, alfalfa (Medicago saliva), rice (Oryza saliva), rye (Secale cereale), sorghum (Sorghum bicolor, Sorghum vulgare), millet (e.g., pearl millet (Pennisetum glaucum), proso millet (Panicum miliaceum), foxtail millet (Setaria italica), finger millet (Eleusine coracana)), sunflower (Helianthus annuus), safflower (Carthamus tinctorius), wheat (Triticum aestivum), soybean (Glycine max, tobacco (Nicotiana tabacum), potato (Solanum tuberosum), peanuts (Arachis hypogaea), cotton (Gossypium barbadense, Gossypium hirsutum), sweet potato (Ipomoea batatus), cassaya (Manihot esculenta), coffee (Cofea spp.), coconut (Cocos nucifera), pineapple (Ananas comosus), citrus trees (Citrusspp.), cocoa (Theobroma cacao), tea (Camellia sinensis), banana (Musa spp.), avocado (Persea ultilane), fig (Ficus casica), guava (Psidium guajava), mango (Mangifera indica), olive (Olea europaea), papaya (Carica papaya), cashew (Anacardium occidentale), macadamia (Macadamia integrifolia), almond (Prunus amygdalus), sugar beets (Beta vulgaris), sugarcane (Saccharum spp.), rye, rape, canola, millet, oats, duckweed (Lemna), barley, vegetables, ornamentals, and conifers.

Duckweed (Lemna, see WO 00/07210) includes members of the family Lemnaceae. There are known four genera and 34 species of duckweed as follows: genus Lemna (L. aequinoctialis, L. disperma, L. ecuadoriensis, L. gibba, L. japonica, L. minor, L. miniscula, L. obscura, L. perpusilla, L. tenera, L. trisulca, L. turionifera, L. valdiviana); genus Spirodela (S. intermedia, S. polyrrhiza, S. punctata), genus Woffia (Wa. Angusta, Wa. Arrhiza, Wa. Australina, Wa. Borealis, Wa. Brasiliensis, Wa. Columbiana, Wa. Elongata, Wa. Globosa, Wa. Microscopica, Wa. Neglecta) and genus Wofiella (W1. ultila, W1. ultilane n, W1. gladiata, W1. ultila, W1. lingulata, W1. repunda, W1. rotunda, and W1. neotropica). Any other genera or species of Lemnaceae, if they exist, are also aspects of the present invention. Lemna gibba, Lemna minor, and Lemna miniscula are preferred, with Lemna minor and Lemna miniscula being most preferred. Lemna species can be classified using the taxonomic scheme described by Landolt, Biosystematic Investigation on the Family of Duckweeds: The family of Lemnaceae—A Monograph Study. Geobatanischen Institut ErH, Stiftung Rubel, Zurich (1986)).

Vegetables from which to obtain or isolate the CNS of the invention include, but are not limited to, tomatoes (Lycopersicon esculentum), lettuce (e.g., Lactuca sativa), green beans (Phaseolus vulgaris), lima beans (Phaseolus limensis), peas (Lathyrus spp.), and members of the genus Cucumis such as cucumber (C. sativus), cantaloupe (C. cantalupensis), and musk melon (C. melo). Ornamentals from which to obtain or isolate the CNS of the invention include, but are not limited to, azalea (Rhododendron spp.), hydrangea (Macrophylla hydrangea), hibiscus (Hibiscus rosasanensis), roses (Rosa spp.), tulips (Tulipa spp.), daffodils (Narcissus spp.), petunias (Petunia hybrida), carnation (Dianthus caryophyllus), poinsettia (Euphorbia pulcherrima), and chrysanthemum. Conifers that may be employed in practicing the present invention include, for example, pines such as loblolly pine (Pinus taeda), slash pine (Pinus elliotii), ponderosa pine (Pinus ponderosa), lodgepole pine (Pinus contorta), and Monterey pine (Pinus radiata), Douglas-fir (Pseudotsuga menziesii); Western hemlock (Tsuga ultilane); Sitka spruce (Picea glauca); redwood (Sequoia sempervirens); true firs such as silver fir (Abies amabilis) and balsam fir (Abies balsamea); and cedars such as Western red cedar (Thuja plicata) and Alaska yellow-cedar (Chamaecyparis nootkatensis). Leguminous plants from which the CNS of the invention can be isolated or obtained include, but are not limited to, beans and peas. Beans include guar, locust bean, fenugreek, soybean, garden beans, cowpea, mungbean, lima bean, fava bean, lentils, chickpea, and the like. Legumes include, but are not limited to, Arachis, e.g., peanuts, Vicia, e.g., crown vetch, hairy vetch, adzuki bean, mung bean, and chickpea, Lupinus, e.g., lupine, trifolium, Phaseolus, e.g., common bean and lima bean, Pisum, e.g., field bean, Melilotus, e.g., clover, Medicago, e.g., alfalfa, Lotus, e.g., trefoil, lens, e.g., lentil, and false indigo.

The plant CNSs of the invention can be isolated from papaya, garlic, pea, peach, pepper, petunia, strawberry, sorghum, sweet potato, turnip, safflower, corn, pea, endive, gourd, grape, snap bean, chicory, cotton, tobacco, aubergine, beet, buckwheat, broad bean, nectarine, avocado, mango, banana, groundnut, potato, peanut, lettuce, pineapple, spinach, squash, sugarbeet, sugarcane, sweet corn, chrysanthemum.

Other preferred sources of the CNS of the invention include Acacia, aneth, artichoke, arugula, blackberry, canola, cilantro, clementines, escarole, eucalyptus, fennel, grapefruit, honey dew, jicama, kiwifruit, lemon, lime, mushroom, nut, okra, orange, parsley, persimmon, plantain, pomegranate, poplar, radiata pine, radicchio, Southern pine, sweetgum, tangerine, triticale, vine, yams, apple, pear, quince, cherry, apricot, melon, hemp, buckwheat, grape, raspberry, chenopodium, blueberry, nectarine, peach, plum, strawberry, watermelon, eggplant, pepper, cauliflower, Brassica, e.g., broccoli, cabbage, ultilan sprouts, onion, carrot, leek, beet, broad bean, celery, radish, pumpkin, endive, gourd, garlic, snapbean, spinach, squash, turnip, ultilane, and zucchini.

Yet other sources of plant CNSs are ornamental plants including, but not limited to, impatiens, Begonia, Pelargonium, Viola, Cyclamen, Verbena, Vinca, Tagetes, Primula, Saint Paulia, Agertum, Amaranthus, Antihirrhinum, Aquilegia, Cineraria, Clover, Cosmo, Cowpea, Dahlia, Datura, Delphinium, Gerbera, Gladiolus, Gloxinia, Hippeastrum, Mesembryanthemum, Salpiglossos, and Zinnia, and other plants. Preferred forage and turf grass nucleic acid sources for the nucleic acid molecules of the invention include, but are not limited to, alfalfa, orchard grass, tall fescue, perennial ryegrass, creeping bent grass, and redtop.

Transgenic Plants of the Invention and Methods of Preparation

In certain circumstances, it is desirable to test, analyze or utilize the CNSs of the present invention in a transgenic plant. As discussed above, several assays (transcription factor identification, etc.) may be performed in transgenic plants. Plant species may be transformed with the DNA constructs of the present invention by the DNA-mediated transformation of plant cell protoplasts and subsequent regeneration of the plant from the transformed protoplasts in accordance with procedures well known in the art.

Any plant tissue capable of subsequent clonal propagation, whether by organogenesis or embryogenesis, may be transformed with a vector of the present invention. The term “organogenesis,” as used herein, means a process by which shoots and roots are developed sequentially from meristematic centers; the term “embryogenesis,” as used herein, means a process by which shoots and roots develop together in a concerted fashion (not sequentially), whether from somatic cells or gametes. The particular tissue chosen will vary depending on the clonal propagation systems available for, and best suited to, the particular species being transformed. Exemplary tissue targets include leaf disks, pollen, embryos, cotyledons, hypocotyls, megagametophytes, callus tissue, existing meristematic tissue (e.g., apical meristems, axillary buds, and root meristems), and induced meristem tissue (e.g., cotyledon meristem and ultilane meristem).

Plants of the present invention may take a variety of forms. The plants may be chimeras of transformed cells and non-transformed cells; the plants may be clonal transformants (e.g., all cells transformed to contain the expression cassette); the plants may comprise grafts of transformed and untransformed tissues (e.g., a transformed root stock grafted to an untransformed scion in citrus species). The transformed plants may be propagated by a variety of means, such as by clonal propagation or classical breeding techniques. For example, first generation (or T1) transformed plants may be selfed to give homozygous second generation (or T2) transformed plants, and the T2 plants further propagated through classical breeding techniques. A dominant selectable marker (such as npt II) can be associated with the expression cassette to assist in breeding.

Transformation of plants can be undertaken with a single DNA molecule or multiple DNA molecules (i.e., co-transformation), and both these techniques are suitable for use with the expression cassettes of the present invention. Numerous transformation vectors are available for plant transformation, and the expression cassettes of this invention can be used in conjunction with any such vectors. The selection of vector will depend upon the preferred transformation technique and the target species for transformation.

A variety of techniques are available and known to those skilled in the art for introduction of constructs into a plant cell host. These techniques generally include transformation with DNA employing A. tumefaciens or A. rhizogenes as the transforming agent, liposomes, PEG precipitation, electroporation, DNA injection, direct DNA uptake, microprojectile bombardment, particle acceleration, and the like (See, for example, EP 295959 and EP 138341) (see below). However, cells other than plant cells may be transformed with the expression cassettes of the invention. The general descriptions of plant expression vectors and reporter genes, and Agrobacterium and Agrobacterium-mediated gene transfer, can be found in Gruber et al. (1993).

Expression vectors containing genomic or synthetic fragments can be introduced into protoplasts or into intact tissues or isolated cells. Preferably expression vectors are introduced into intact tissue. General methods of culturing plant tissues are provided for example by Maki et al., (1993); and by Phillips et al. (1988). Preferably, expression vectors are introduced into maize or other plant tissues using a direct gene transfer method such as microprojectile-mediated delivery, DNA injection, electroporation and the like. More preferably expression vectors are introduced into plant tissues using the microprojectile media delivery with the biolistic device. (See, for example, Tomes et al. 1995). The vectors of the invention can not only be used for expression of structural genes but may also be used in exon-trap cloning, or promoter trap procedures to detect differential gene expression in varieties of tissues, (Uindsey et al., 1993; Auch & Reth et al.).

It is particularly preferred to use the binary type vectors of Ti and Ri plasmids of Agrobacterium spp. Ti-derived vectors transform a wide variety of higher plants, including monocotyledonous and dicotyledonous plants, such as soybean, cotton, rape, tobacco, and rice (Pacciotti et al., 1985: Byrne et al., 1987; Sukhapinda et al., 1987; Park et al., 1985: Hiei et al., 1994). The use of T-DNA to transform plant cells has received extensive study and is amply described (EP 120516; Hoekema, 1985; Knauf, et al., 1983; and An et al., 1985). For introduction into plants, the chimeric genes of the invention can be inserted into binary vectors as described in the examples.

Other transformation methods are available to those skilled in the art, such as direct uptake of foreign DNA constructs (see EP 295959), techniques of electroporation (Fromm et al., 1986) or high velocity ballistic bombardment with metal-particles coated with the nucleic acid constructs (Kline et al., 1987, and U.S. Pat. No. 4,945,050). Once transformed, the cells can be regenerated by those skilled in the art. Of particular relevance are the recently described methods to transform foreign genes into commercially important crops, such as rapeseed (De Block et al., 1989), sunflower (Everett et al., 1987), soybean (McCabe et al., 1988; Hinchee et al., 1988; Chee et al., 1989; Christou et al., 1989; EP 301749), rice (Hiei et al., 1994), and corn (Gordon Kamm et al., 1990; Fromm et al., 1990).

Those skilled in the art will appreciate that the choice of method might depend on the type of plant, i.e., monocotyledonous or dicotyledonous, targeted for transformation. Suitable methods of transforming plant cells include, but are not limited to, microinjection (Crossway et al., 1986), electroporation (Riggs et al., 1986), Agrobacterium-mediated transformation (Hinchee et al., 1988), direct gene transfer (Paszkowski et al., 1984), and ballistic particle acceleration using devices available from Agracetus, Inc., Madison, Wis. And BioRad, Hercules, Calif. (see, for example, Sanford et al., U.S. Pat. No. 4,945,050; and McCabe et al., 1988). Also see, Weissinger et al., 1988; Sanford et al., 1987 (onion); Christou et al., 1988 (soybean); McCabe et al., 1988 (soybean); Datta et al., 1990 (rice); Klein et al., 1988 (maize); Klein et al., 1988 (maize); Klein et al., 1988 (maize); Fromm et al., 1990 (maize); and Gordon-Kamm et al., 1990 (maize); Svab et al., 1990 (tobacco chloroplast); Koziel et al., 1993 (maize); Shimamoto et al., 1989 (rice); Christou et al., 1991 (rice); European Patent Application EP 0 332 581 (orchardgrass and other Pooideae); Vasil et al., 1993 (wheat); Weeks et al., 1993 (wheat). In one embodiment, the protoplast transformation method for maize is employed (European Patent Application EP 0 292 435, U.S. Pat. No. 5,350,689).

In another embodiment, a nucleotide sequence of the present invention is directly transformed into the plastid genome. Plastid transformation technology is extensively described in U.S. Pat. Nos. 5,451,513, 5,545,817, and 5,545,818, in PCT application no. WO 95/16783, and in McBride et al., 1994. The basic technique for chloroplast transformation involves introducing regions of cloned plastid DNA flanking a selectable marker together with the gene of interest into a suitable target tissue, e.g., using biolistics or protoplast transformation (e.g., calcium chloride or PEG mediated transformation). The 1 to 1.5 kb flanking regions, termed targeting sequences, facilitate orthologous recombination with the plastid genome and thus allow the replacement or modification of specific regions of the plastome. Initially, point mutations in the chloroplast 16S rRNA and rps12 genes conferring resistance to spectinomycin and/or streptomydn are utilized as selectable markers for transformation (Svab et al., 1990; Staub et al., 1992). This resulted in stable homoplasmic transformants at a frequency of approximately one per 100 bombardments of target leaves. The presence of cloning sites between these markers allowed creation of a plastid targeting vector for introduction of foreign genes (Staub et al., 1993). Substantial increases in transformation frequency are obtained by replacement of the recessive rRNA or r-protein antibiotic resistance genes with a dominant selectable marker, the bacterial aadA gene encoding the spectinomycin-detoxifying enzyme aminoglycoside-3N-adenyltransferase (Svab et al., 1993). Other selectable markers useful for plastid transformation are known in the art and encompassed within the scope of the invention. Typically, approximately 15-20 cell division cycles following transformation are required to reach a homoplastidic state. Plastid expression, in which genes are inserted by orthologous recombination into all of the several thousand copies of the circular plastid genome present in each plant cell, takes advantage of the enormous copy number advantage over nuclear-expressed genes to permit expression levels that can readily exceed 10% of the total soluble plant protein. In a preferred embodiment, a nucleotide sequence of the present invention is inserted into a plastid targeting vector and transformed into the plastid genome of a desired plant host. Plants homoplastic for plastid genomes containing a nucleotide sequence of the present invention are obtained, and are preferentially capable of high expression of the nucleotide sequence.

Agrobacterium tumefaciens cells containing a vector comprising an expression cassette of the present invention, wherein the vector comprises a Ti plasmid, are useful in methods of making transformed plants. Plant cells are infected with an Agrobacterium tumefaciens as described above to produce a transformed plant cell, and then a plant is regenerated from the transformed plant cell. Numerous Agrobacterium vector systems useful in carrying out the present invention are known.

For example, vectors are available for transformation using Agrobacterium tumefaciens. These typically carry at least one T-DNA border sequence and include vectors such as pBIN19 (Bevan, 1984). In one preferred embodiment, the expression cassettes of the present invention may be inserted into either of the binary vectors pCIB200 and pCIB2001 for use with Agrobacterium. These vector cassettes for Agrobacterium-mediated transformation wear constructed in the following manner. PTJS75kan was created by NarI digestion of pTJS75 (Schmidhauser & Helinski, 1985) allowing excision of the tetracycline-resistance gene, followed by insertion of an AccI fragment from pUC4K carrying an NPTII (Messing & Vierra, 1982; Bevan et al., 1983; McBride et al., 1990). XhoI linkers were ligated to the EcoRV fragment of pCIB7 which contains the left and right T-DNA borders, a plant selectable nos/nptII chimeric gene and the pUC polylinker (Rothstein et al., 1987), and the XhoI-digested fragment was cloned into SalI-digested pTJS75kan to create pCIB200 (see also EP 0 332 104, example 19). PCIB200 contains the following unique polylinker restriction sites: EcoRI, SstI, KpnI, BglII, XbaI, and SalI. The plasmid pCIB2001 is a derivative of pCIB200 which was created by the insertion into the polylinker of additional restriction sites. Unique restriction sites in the polylinker of pCIB2001 are EcoRI, Sstl, KpnI, BglII, XbaI, SalIl, MluI, BclI, AvrII, ApaI, HpaI, and StuI. PCIB2001, in addition to containing these unique restriction sites also has plant and bacterial kanamycin selection, left and right T-DNA borders for Agrobacterium-mediated transformation, the RK2-derived trfA function for mobilization between E coli and other hosts, and the OriT and OriV functions also from RK2. The pCIB2001 polylinker is suitable for the cloning of plant expression cassettes containing their own regulatory signals.

An additional vector useful for Agrobacterium-mediated transformation is the binary vector pCIB 10, which contains a gene encoding kanamycin resistance for selection in plants, T-DNA right and left border sequences and incorporates sequences from the wide host-range plasmid pRK252 allowing it to replicate in both E. coli and Agrobacterium. Its construction is described by Rothstein et al., 1987. Various derivatives of pCIB10 have been constructed which incorporate the gene for hygromycin B phosphotransferase described by Gritz et al., 1983. These derivatives enable selection of transgenic plant cells on hygromycin only (pCIB743), or hygromycin and kanamycin (pCIB715, pCIB717).

Methods using either a form of direct gene transfer or Agrobacterium-mediated transfer usually, but not necessarily, are undertaken with a selectable marker which may provide resistance to an antibiotic (e.g., kanamycin, hygromycin or methotrexate) or a herbicide (e.g., phosphinothricin). The choice of selectable marker for plant transformation is not, however, critical to the invention.

For certain plant species, different antibiotic or herbicide selection markers may be preferred. Selection markers used routinely in transformation include the nptII gene which confers resistance to kanamycin and related antibiotics (Messing & Vierra, 1982; Bevan et al., 1983), the bar gene which confers resistance to the herbicide phosphinothricin (White et al., 1990, Spencer et al., 1990), the hph gene which confers resistance to the antibiotic hygromycin (Blochinger & Diggelmann), and the dhfr gene, which confers resistance to methotrexate (Bourouis et al., 1983).

One such vector useful for direct gene transfer techniques in combination with selection by the herbicide Basta (or phosphinothricin) is pCIB3064. This vector is based on the plasmid pCIB246, which comprises the CaMV 35S promoter in operational fusion to the E. coli GUS gene and the CaMV 35S transcriptional terminator and is described in the PCT published application WO 93/07278, herein incorporated by reference. One gene useful for conferring resistance to phosphinothricin is the bar gene from Streptomyces viridochromogenes (Thompson et al., 1987). This vector is suitable for the cloning of plant expression cassettes containing their own regulatory signals.

An additional transformation vector is pSOG35 which utilizes the E. coli gene dihydrofolate reductase (DHFR) as a selectable marker conferring resistance to methotrexate. PCR was used to amplify the 35S promoter (about 800 bp), intron 6 from the maize Adh1 gene (about 550 bp) and 18 bp of the GUS untranslated leader sequence from pSOG10. A 250 bp fragment encoding the E. coli dihydrofolate reductase type II gene was also amplified by PCR and these two PCR fragments were assembled with a SacI-PstI fragment from pBI221 (Clontech) which comprised the pUC19 vector backbone and the nopaline synthase terminator. Assembly of these fragments generated pSOG19 which contains the ³⁵S promoter in fusion with the intron 6 sequence, the GUS leader, the DHFR gene and the nopaline synthase terminator. Replacement of the GUS leader in pSOG19 with the leader sequence from Maize Chlorotic Mottle Virus check (MCMV) generated the vector pSOG35. pSOG19 and pSOG35 carry the pUC-derived gene for ampicillin resistance and have HindIII, SphI, PstI and EcoRI sites available for the cloning of foreign sequences.

Production and Characterization of Stably Transformed Plants

Transgenic plant cells are then placed in an appropriate selective medium for selection of transgenic cells which are then grown to callus. Shoots are grown from callus and plantlets generated from the shoot by growing in rooting medium. The various constructs normally will be joined to a marker for selection in plant cells. Conveniently, the marker may be resistance to a biocide (particularly an antibiotic, such as kanamycin, G418, bleomycin, hygromycin, chloramphenicol, herbicide, or the like). The particular marker used will allow for selection of transformed cells as compared to cells lacking the DNA which has been introduced. Components of DNA constructs including transcription cassettes of this invention may be prepared from sequences which are native (endogenous) or foreign (exogenous) to the host. By “foreign” it is meant that the sequence is not found in the wild-type host into which the construct is introduced. Heterologous constructs will contain at least one region which is not native to the gene from which the transcription-initiation-region is derived.

To confirm the presence of the transgenes in transgenic cells and plants, a variety of assays may be performed. Such assays include, for example, “molecular biological” assays well known to those of skill in the art, such as Southern and Northern blotting, in situ hybridization and nucleic acid-based amplification methods such as PCR or RT-PCR; “biochemical” assays, such as detecting the presence of a protein product, e.g., by immunological means (ELISAs and Western blots) or by enzymatic function; plant part assays, such as leaf or root assays; and also, by analyzing the phenotype of the whole regenerated plant, e.g., for disease or pest resistance.

DNA may be isolated from cell lines or any plant parts to determine the presence of the preselected nucleic acid segment through the use of techniques well known to those skilled in the art. Note that intact sequences will not always be present, presumably due to rearrangement or deletion of sequences in the cell.

The presence of nucleic acid elements introduced through the methods of this invention may be determined by polymerase chain reaction (PCR). Using this technique discreet fragments of nucleic acid are amplified and detected by gel electrophoresis. This type of analysis permits one to determine whether a preselected nucleic acid segment is present in a stable transformant, but does not prove integration of the introduced preselected nucleic acid segment into the host cell genome. In addition, it is not possible using PCR techniques to determine whether transformants have exogenous genes introduced into different sites in the genome, i.e., whether transformants are of independent origin. It is contemplated that using PCR techniques it would be possible to clone fragments of the host genomic DNA adjacent to an introduced preselected DNA segment.

Positive proof of DNA integration into the host genome and the independent identities of transformants may be determined using the technique of Southern hybridization. Using this technique specific DNA sequences that were introduced into the host genome and flanking host DNA sequences can be identified. Hence the Southern hybridization pattern of a given transformant serves as an identifying characteristic of that transformant. In addition it is possible through Southern hybridization to demonstrate the presence of introduced preselected DNA segments in high molecular weight DNA, i.e., confirm that the introduced preselected DNA segment has been integrated into the host cell genome. The technique of Southern hybridization provides information that is obtained using PCR, e.g., the presence of a preselected DNA segment, but also demonstrates integration into the genome and characterizes each individual transformant.

It is contemplated that using the techniques of dot or slot blot hybridization which are modifications of Southern hybridization techniques one could obtain the same information that is derived from PCR, e.g., the presence of a preselected DNA segment.

Both PCR and Southern hybridization techniques can be used to demonstrate transmission of a preselected DNA segment to progeny. In most instances the characteristic Southern hybridization pattern for a given transformant will segregate in progeny as one or more Mendelian genes (Spencer et al., 1992); Laursen et al., 1994) indicating stable inheritance of the gene. The nonchimeric nature of the callus and the-parental transformants (R₀) was suggested by germline transmission and the identical Southern blot hybridization patterns and intensities of the transforming DNA in callus, R₀ plants and R₁ progeny that segregated for the transformed gene.

Whereas DNA analysis techniques may be conducted using DNA isolated from any part of a plant, RNA may only be expressed in particular cells or tissue types and hence it will be necessary to prepare RNA for analysis from these tissues. PCR techniques may also be used for detection and quantitation of RNA produced from introduced preselected DNA segments. In this application of PCR it is first necessary to reverse transcribe RNA into DNA, using enzymes such as reverse transcriptase, and then through the use of conventional PCR techniques amplify the DNA. In most instances PCR techniques, while useful, will not demonstrate integrity of the RNA product. Further information about the nature of the RNA product may be obtained by Northern blotting. This technique will demonstrate the presence of an RNA species and give information about the integrity of that RNA. The presence or absence of an RNA species can also be determined using dot or slot blot Northern hybridizations. These techniques are modifications of Northern blotting and will only demonstrate the presence or absence of an RNA species.

While Southern blotting and PCR may be used to detect the preselected DNA segment in question, they do not provide information as to whether the preselected DNA segment is being expressed. Expression may be evaluated by specifically identifying the protein products of the introduced preselected DNA segments or evaluating the phenotypic changes brought about by their expression.

Assays for the production and identification of specific proteins may make use of physical-chemical, structural, functional, or other properties of the proteins. Unique physical-chemical or structural properties allow the proteins to be separated and identified by electrophoretic procedures, such as native or denaturing gel electrophoresis or isoelectric focusing, or by chromatographic techniques such as ion exchange or gel exclusion chromatography. The unique structures of individual proteins offer opportunities for use of specific antibodies to detect their presence in formats such as an ELISA assay. Combinations of approaches may be employed with even greater specificity such as Western blotting in which antibodies are used to locate individual gene products that have been separated by electrophoretic techniques. Additional techniques may be employed to absolutely confirm the identity of the product of interest such as evaluation by amino acid sequencing following purification. Although these are among the most commonly employed, other procedures may be additionally used.

Assay procedures may also be used to identify the expression of proteins by their functionality, especially the ability of enzymes to catalyze specific chemical reactions involving specific substrates and products. These reactions may be followed by providing and quantifying the loss of substrates or the generation of products of the reactions by physical or chemical procedures. Examples are as varied as the enzyme to be analyzed.

Very frequently the expression of a gene product is determined by evaluating the phenotypic results of its expression. These assays also may take many forms including but not limited to analyzing changes in the chemical composition, morphology, or physiological properties of the plant. Morphological changes may include greater stature or thicker stalks. Most often changes in response of plants or plant parts to imposed treatments are evaluated under carefully controlled conditions termed bioassays.

It is understood that the embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this specification. All publications, patents, and patent applications cited herein are hereby incorporated by reference for all purposes to the same extent as if each individual publication, patent or patent application were specifically and individually indicated to be so incorporated by reference. The following example is provided to illustrate, but no to limit, the invention.

EXAMPLE

Identification of CNSs

DNA sequences were processed using Perl scripts to automate CNS discovery. Coding regions of maize and rice genes were identified by comparing genomic sequences to predicted protein translations using NCBI's blastx (Altschul, S. F. et al. (1997) Nucleic Acids Research 25, 3389-3402). The coding regions were masked out for further comparisons. The two. masked genomic sequences were then compared using NCBI's bl2seq (Tatusova, T. A. & Madden, T. L. (1999) FEMS Microbiology Letters 174, 247-250), with the following parameters: for finding stringent CNSs (FIG. 1A and Table 2) Word size=7, Gap Penalties: Existence=5, Extension=2; for finding long CNSs (FIG. 6B) Word size=7, Gap Penalties: Existence

=2, Extension=1. The positions of CNSs relative to intron/exon structure were visualized using Perl scripts to parse the output of both the blastx and bl2seq results.

Sequence Information

The Genbank accession number for the rice lg1 containing BAC is AL442117. 7,142 bp of maize lg1 was sequenced from a genomic DNA clone (AF451895). The lg1 intron 1 grasses Genbank accession numbers are: Setaria (AF451892), Arundo (AF451893) and bamboo (AF451894). The Genbank accession numbers for the additional seven maize and rice genes used to obtain the data of Table 2 are: adh1 maize (X04049) and rice (AF172282); chalcone synthase maize (X60204.1) and rice (AB058397.1); maize lg3, missing intron3 (AF479591, AF479592) and rice osh6 (AP002881); sh2 maize (M81603) and rice (AF101045); wx1 maize (X03935) and rice (AF141955); H+ ATPase maize (AF480431) and rice (AL442117). The mouse BAC is AC002397. The human region is 12p13, U47924.

PCR and Southern Blots

The lg1 CNS3: exon 1 PCR products were amplified using primer pair CNS3-2 (gctatcaccmctcaccaactccaattaa; m=a or c (SEQ ID NO: 2)) and lgl-77 (cgctggagaggtcggcctt (SEQ ID NO: 3)) with 94 C 30 sec, 60 C 30 sec, 72 C 1 min cycling conditions for 35 cycles. The PCR products were hybridized with a combined maize and rice lg1 exon 1 derived probe that did not include the highly conserved SBP domain. The maize exon 1 probe was amplified from a maize lg1 cDNA clone using the following primers: ccgaattcatggagcagcagcaggagagc (SEQ ID NO: 4) and tggccgtacgtgtagccctctggg (SEQ ID NO: 5); the rice exon 1 probe was amplified from Nipponbare genomic DNA using primers: atgatgaacgttccatccgcc (SEQ ID NO: 6) and tcggcgggaagtaggtccggta (SEQ ID NO: 7). Both amplification conditions were the same as used to amplify the CNS3-2: lg1-77 PCR products. Hybridization and washes were done at 65 C, and the blot was washed in 0.2×SSC/0.2% SDS. lg1 intron 1 from various grasses were PCR amplified in two overlapping fragments using an exon 1 derived primer (lg1-77 Rev, aaggccgacctctccagcg (SEQ ID NO: 8)) with a CNS 1 primer (tcagctgcaaaactccaagctgtct (SEQ ID NO: 9)), and a CNS 4 primer (gatcaagactttgtacagcc (SEQ ID NO: 10)) with an exon 2 derived primer (tcttrgcatcgtcgaactcatc; r=a or g (SEQ ID NO: 11)). Conditions for Lg1-77Rev: CNS 1-2R: 97° C. 2 min, 30 cycles of 94° C. 30 sec, 62° C. 1 min, 72° C. 2 min. Conditions for CNS 4-lg1-79: 97° C. 2 min, 30 cycles of 94° C. 30 sec, 54° C. 30 sec, 72° C. 1 min. The PCR products were cloned using Invitrogen's pTOPO cloning system and sequenced at the UC-Berkeley DNA sequencing facility. The fragments were assembled using DNASTAR Seqman II software. When necessary, strand specific primers were used to fill in any gaps in the sequences.

CNSs Exist in liguleless1

liguleless1 (lg1) encodes a member of the squamosa promoter binding protein (SBP) family. It is required to specify the ligule, an epidermal organ that exists along a line that bisects the grass leaf into sheath and blade (24-27). Because grasses share the homologous ligule structure, we reasoned that lg1 gene regulation may also be evolutionarily conserved. To assess the conserved regulatory sequences we identified an unannotated rice bacterial artificial chromosome (BAC) that contains the rice lg1 orthologous gene. This sequence maps to rice chromosome 4L (data not shown) which is the syntenous region to maize chromosome 2S where lg1 resides (26). Within the coding region a comparison of the predicted maize and rice proteins shows them to be 70% similar in amino acid sequence overall and identical in the SBP domain (data not shown). We identified CNSs based on three criteria: conservation of sequence, conservation of position relative to the intron/exon structure of the gene, and a size greater than or equal to 15 basepairs (bp). For reference, any 15 bp stretch is expected to exist about once, (1/4)¹⁵, in the entire ˜400 megabasepair rice genome. Therefore, we utilized 15 identical bp conserved between maize and rice (i.e., 15/15) as our cut off, but we recognize that shorter sequences are also important.

A comparison of 7142 bp of maize genomic DNA sequence containing the three exons of lg1 (and about 5.9 kb of noncoding sequence) and 7100 bp of rice genomic sequence identified 7 CNSs that met our stringent criteria (see methods). The same CNSs were identified when the entire rice BAC sequence (˜50 kb) was used in the comparison (data not shown). FIG. 1A shows lg1 “stringent CNSs” listed as CNSs 1-7 in the order of their significance. These seven CNSs identified in FIG. 1A are too small to fit the definition used by most researchers working with mammalian systems, >100 bp in length and >70% identity (11). However, if we relax the gap and extension penalties to those used by Levy and coworkers (15), we display the lg1 CNSs in a different way (FIG. 1B). Here, we find two long CNSs, CNS1′ (97/124) and CNS2′ (84/107), in intron 1 replacing the CNS4-CNS5-CNS1-CNS6 region of FIG. 1A. Both of these less stringent CNSs meet the >100 bp, >70% identity criteria often used in mammalian studies.

lg1 CNSs are Conserved Among Many Grasses

To test whether the CNSs identified between maize and rice are conserved in other grasses we devised a polymerase chain reaction (PCR) based assay. CNS3 (FIG. 1A) was used to design a PCR primer (CNS3-2) facing downstream toward exon 1. A second primer (lg1-77) was chosen within exon 1 directed upstream toward the promoter and CNS3. In addition to maize and rice, this primer pair amplified a fragment from five more grass species (FIG. 2). In all, seven tribes and five different subfamilies (panicoids, chloridoids, bambusoids, oryzoids, and arundinoids) are represented (28, 29). FIG. 2A shows an ethidium bromide stained gel with one predominant amplification product from each species. FIG. 2B shows a Southern blot of this gel hybridized with a combined maize and rice lg1 specific exon 1 probe. In each case the predominant PCR product is homologous to the lg1 exon probe. Note the lg1 hybridizing sequences are polymorphic in length between the various grasses, which is expected because they contain noncoding sequences.

To further examine the conservation of CNSs in lg1 we PCR amplified intron 1 from three other grasses, a bamboo, a Setaria, and an Arundo, and compared them with maize and rice using ClustalX to generate a 5 way sequence alignment (30). These five sequences represent five tribes and four subfamilies of grasses. The results are displayed in Table 1. Full intron sequences are posted online at PNAS web site. All five intron 1 stringent CNSs are highly conserved among all these grasses in both sequence and position. Note that when the bamboo, Setaria and Arundo sequences were added to the original rice and maize comparison, the CNSs became more refined.

CNSs Exist in Other Grass Genes

To establish that the results we obtained at lg1 were general properties of grass genes, we searched for stringent CNSs in seven additional maize/rice gene pairs. The genes we have chosen have unambiguous, experimental exon annotation. Table 2 summarizes the number of stringent CNSs found, sorted by length; six of these genes contained one or more CNSs with a significance greater than or equal to a 15 bp exact match. None of these sequences contained stringent CNSs greater than 60 bp in length. Researchers working in mammalian systems have identified larger CNSs (11, 15), suggesting that mammalian and higher plant gene CNSs may, on average, differ. In order to compare our results directly to mammalian analyses, we used our stringent algorithm to identify the CNSs in a previously annotated 52 kb region of man and mouse genomic DNA containing six genes (31). Our analysis is in complete concordance with that of other mammalian CNS researchers as we identified twelve large (>100 bp) CNSs and 94 smaller CNSs in these mammalian genes (Table 2). TABLE 1 lg1 intron 1 CNSs are conserved in divergent grasses. Intron 1 sequence from Arundo donax, Setaria viridis, a bamboo, maize and rice were aligned using ClustalX 1.81. CNSs 4, 5, 1, 6 and 7 are listed. Asterisks represent sequence identity among all five species. SEQ ID SEQ ID Species CNS4 NO CNS5 NO maize CAA-GACTTTGTACAGCC 12 ACGTACGTACGTACGTACGAGCAAAAA-TTGCAAATGCAAGCAAC 17 rice CAAAGACTTTGTACAGCC 13 ACGTACATATGTACGTACCAGCACAAG-TTGCATTTGCAAGCAAC 18 Setaria CAA-GACTTTGTACAGCC 14 ---CGTGCACGTATGTACGAGCAAAAAATTCCAATTGCAAGCAAC 19 Arundo CAA-GACTTTGTACAGCC 15 ---CATGTACGTACGTACCAGCAAAAG-TTGCAATTGCGAGCGAC 20 bamboo CAA-GACTTTGTACAGCC 16 --------ACGTACGTACCAGCACAAG-TTGCGATTGCAAGCAAC 21 Consensus *** **************         * *** **** **** **  ** *   *** *** ** SEQ ID SEQ ID SEQ ID Species CNS1 NO CNS6 NO CNS7 NO maize TCAGCTGCAAAACTCCAAGCTGTCT 22 ATTGGTCCCCGGAATGTTC 27 GTATTTCATGTGTAT 32 rice TCAGCTGCAAAACTCCAAGCTGTCT 23 ATTGGTCCCCAGAATGTTC 28 GTATTTCATGTGTAT 33 Setaria TTAGCTGCAAGACTCCAAGCTGTCT 24 GTTGGTCCC-GGAATGTTC 29 GTCTTTGCTGTTAAT 34 Arundo TCAGCTGCAAGAATCCAAGCTGTCT 25 ATTGGTCCCCGGAATGTTC 30 GTATTTCATGTGTAT 35 bamboo TCAGCTGCAAAACTCCAAGCTGTCT 26 ATTGGTCCCCAGAATGTTC 31 GTATTTCATGTGTAT 36 Consensus * ******** * ************  ********  ******** ** ***  ***  **

TABLE 2 Summary of CNS frequencies and size-distributions in seven rice/maize gene pairs. Maize and rice genomic DNA sequences were analyzed for the presence of CNSs. The genes are listed with the frequency of CNSs of the various size classes indicated. The last column lists the size of the smaller of the two sequences being compared. Size of 15- 21- 31- 61- >100 compared Gene Name 20 bp 30 bp 60 bp 99 bp bp region (bp) lg1 4 1 2 0 0 7142 lg3/osh6 1 0 1 0 0 2300 chalcone synthase 1 0 0 0 0 4000 wx1 0 0 0 0 0 4800 adh1 2 0 0 0 0 6000 sh2 0 1 0 0 0 7320 H⁺ATPase 1 0 0 0 0 6725 6 gene mammalian 26 29 29 10 12 51187 region

Using cross species genomic DNA sequence comparisons, we have identified CNSs in several grass genes. Comparison of maize and rice genomic sequences at seven loci identified CNSs in six genes. We extended this finding by showing that the lg1 CNSs are highly conserved among all grasses analyzed. These potential regulatory elements will help to elucidate networks of gene regulation.

The question of what a functional CNS unit is becomes particularly interesting since mammals' certainly have more long CNSs than plants (Table 2). This raises the question of whether a long CNS functions as a unit, is a collection of shorter functional units, or both. One hypothetical function for long CNSs is as a template for the assembly of regulatory protein complexes that may not associate on their own. Our data suggest that regulatory complexity differences between mammals and grasses might be reflected in the complexity of the genes themselves. Man and mouse are estimated to have diverged from a common ancestor living in late Cretaceous period (65-98.9 MYA) (33), and a median of many rodent/human genes have a synonymous (neutral) substitution frequency of 4×10⁻⁹ mutations/yr (34). Thus, mammalian and grass CNSs may be compared with one another.

Comparative gene mapping in grasses has been done using conserved exons as hybridization probes (4). Current gene-specific PCR-based mapping tools, like SSR primers, are useful within one or a few closely related species. CNS-derived mapping tools should work for all species within a family and perhaps related families. With or without exon sequence, CNSs could be used to prepare PCR-based gene specific amplification products that are designed to detect polymorphisms. We successfully used primers complimentary to CNS 1 and 4 along with exon derived primers to amplify lg1 intron 1 demonstrating that CNS 1 and 4 are useful pan-grass primer sites. The other CNS primer we tested, CNS3, also specifically amplified lg1 from all of the grasses tested (FIG. 2). The identification of CNSs in many orthologous gene pairs should quickly lead to the development of multiple PCR markers and high density linkage maps.

No lg1 CNSs exist intact anywhere in the rice genome outside of lg1 itself; this conclusion results from conducting short sequence searches of the complete TMRI Nipponbare Japonica rice genome. However, we were able to detect partial identity to lg1 stringent CNSs in rice and other organisms. For example, CNS1 (25/25 bp) exists as 17/17 bp in a rice gene (AF488772) similar to a hypothetical Arabidoposis gene. However, as we lack the sequence of a maize ortholog of this rice gene, we can not assay if this 17 bp sequence is itself a CNS, and therefore meaningful. As argued convincingly by Hardison and coworkers, a sequenced mouse genome is necessary to annotate the human genome (35). Without an annotation sequence, finding human exons has proven problematic (36). With the rice genome now completed, this same scenario is now true in the grasses (37). A second sequenced grass genome is needed for identification and annotation of conserved sequences, including coding and regulatory components of genes. As another example, of the seven lg1 CNSs, only one is obviously over-represented in Genbank. A component of CNS5, acgtacgtacgtacgtacga (Table 1) is found in 12 of the 17 top hits (19/19 bp) in different genes of various grasses. The other five top hits were animal, and surprisingly, none were from Arabidopsis. We suspected a grass family transposon sequence, but there were no transposon terminal repeats or insertion site duplications nearby. With a second fully sequenced grass genome, testing the hypothesis that this sequence is conserved and thus biologically important would be possible.

CNSs Exist in Multiple Genes.

Table 3A presents comparative sequence data for the maize and rice adh genes. Table 3B presents comparative sequence data for the maize and rice Cs genes. Table 3C presents comparative sequence data for the maize and rice sh2 genes. These results demonstrate that CNSs exist at three different genes. TABLE 3A DNA files adh1.maize (Query) and rice.adh1.small (Subject), protein file adh1.prot.maize: Query = (5904 letters) >rice adh1.small         Length = 7000 Score = 34.2 bits (17), Expect = 0.002 #1 Identities = 17/17 (100%) Strand = Plus/Plus Query: 816 gcgcatccgacggccac 832 (SEQ ID NO: 37) ||||||||||||||||| Sbjct: 1505 gcgcatccgacggccac 1521 Score = 30.2 bits (15), Expect = 0.033 #2 Identities = 15/15 (100%) Strand = Plus/Plus Query: 1966 ggttgacttgcgcct 1980 (SEQ ID NO: 38) ||||||||||||||| Sbjct: 2354 ggttgacttgcgcct 2368 Score = 28.2 bits (14), Expect = 0.13 #3 Identities = 14/14 (100%) Strand = Plus/Minus Query: 2086 ctgactatagatgt 2099 (SEQ ID NO: 39) |||||||||||||| Sbjct: 6000 ctgactatagatgt 5987 Score = 28.2 bits (14), Expect = 0.13 #4 Identities = 14/14 (100%) Strand = Plus/Plus Query: 461 tttattaagctaat 474 (SEQ ID NO: 40) |||||||||||||| Sbjct: 5339 tttattaagctaat 5352 Score = 28.2 bits (14), Expect = 0.13 #5 Identities = 14/14 (100%) Strand = Plus/Plus Query: 2844 ttgagatgctgagt 2857 (SEQ ID NO: 41) |||||||||||||| Sbjct: 3649 ttgagatgctgagt 3662 Score = 26.3 bits (13), Expect = 0.51 #6 Identities = 16/17 (94%) Strand = Plus/Plus Query: 786 gctccgagccgcagatc 802 (SEQ ID NO: 42) ||||| ||||||||||| Sbjct: 1476 gctcccagccgcagatc 1492 (SEQ ID NO: 43) Score = 26.3 bits (13), Expect = 0.51 Identities = 13/13 (100%) Strand = Plus/Plus Query: 5094 ttattttattttt 5106 (SEQ ID NO: 44) ||||||||||||| Sbjct: 248 ttattttattttt 260 Score = 24.3 bits (12), Expect = 2.0 Identities = 12/12 (100%) Strand = Plus/Plus Query: 5465 taaattgtattt 5476 (SEQ ID NO: 45) |||||||||||| Sbjct: 351 taaattgtattt 362 Score = 24.3 bits (12), Expect = 2.0 Identities = 12/12 (100%) Strand = Plus/Minus Query: 2312 tatgattctttc 2323 (SEQ ID NO: 46) |||||||||||| Sbjct: 2877 tatgattctttc 2866 Score = 24.3 bits (12), Expect = 2.0 Identities = 15/16 (93%) Strand = Plus/Plus Query: 3415 ttagtcattttattac 3430 (SEQ ID NO: 47) |||| ||||||||||| Sbjct: 1966 ttaggcattttattac 1981 (SEQ ID NO: 48) Score = 24.3 bits (12), Expect = 2.0 Identities = 12/12 (100%) Strand = Plus/Plus Query: 3213 cattttattact 3224 (SEQ ID NO: 49) |||||||||||| Sbjct: 1971 cattttattact 1982 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 2246 tattatattat 2256 (SEQ ID NO: 50) ||||||||||| Sbjct: 775 tattatattat 765 Score = 22.3 bits (11), Expect = 8.0 Identities = 14/15 (93%) Strand = Plus/Plus Query: 3001 tcaattgcaatgaga 3015 (SEQ ID NO: 51) |||||| |||||||| Sbjct: 3793 tcaattacaatgaga 3807 (SEQ ID NO: 52) Score = 22.3 bits (11), Expect = 8.0 Identities = 14/15 (93%) Strand = Plus/Minus Query: 880 cccaccagtccacca 894 (SEQ ID NO: 53) ||||||| ||||||| Sbjct: 2137 cccaccaatccacca 2123 (SEQ ID NO: 54) Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 4437 ctgtatcagta 4447 (SEQ ID NO: 55) ||||||||||| Sbjct: 4123 ctgtatcagta 4113 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 5465 taaattgtatt 5475 (SEQ ID NO: 56) ||||||||||| Sbjct: 504 taaattgtatt 514 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 2124 attatgttttt 2134 (SEQ ID NO: 57) ||||||||||| Sbjct: 2111 attatgttttt 2121 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 468 agctaatttta 478 (SEQ ID NO: 58) ||||||||||| Sbjct: 38 agctaatttta 48 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 5096 attttattttt 5106 (SEQ ID NO: 59) ||||||||||| Sbjct: 225 attttattttt 235 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 4986 gggctgttctg 4996 (SEQ ID NO: 60) ||||||||||| Sbjct: 5143 gggctgttctg 5153 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 1681 ttcgaatttta 1691 (SEQ ID NO: 61) ||||||||||| Sbjct: 112 ttcgaatttta 122 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 5465 taaattgtatt 5475 (SEQ ID NO: 62) ||||||||||| Sbjct: 47 taaattgtatt 57 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 470 ctaattttaga 480 (SEQ ID NO: 63) ||||||||||| Sbjct: 388 ctaattttaga 398 Score = 22.3 bits (11), Expect = 8.0 Identities = 14/15 (93%) Strand = Plus/Plus Query: 4533 gagaaggaaaagaag 4547 (SEQ ID NO: 64) |||||||| |||||| Sbjct: 6775 gagaaggagaagaag 6789 (SEQ ID NO: 65) Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 4404 tacaaattcca 4414 (SEQ ID NO: 66) ||||||||||| Sbjct: 5934 tacaaattcca 5944 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 983 cgtggtttgct 993 (SEQ ID NO: 67) ||||||||||| Sbjct: 1623 cgtggtttgct 1633 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 365 tttaaaaatag 375 (SEQ ID NO: 68) ||||||||||| Sbjct: 507 tttaaaaatag 497 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 2246 tattatattat 2256 (SEQ ID NO: 69) ||||||||||| Sbjct: 995 tattatattat 985 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 365 tttaaaaatag 375 (SEQ ID NO: 70) ||||||||||| Sbjct: 354 tttaaaaatag 344 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 381 gaatatgttag 391 (SEQ ID NO: 71) ||||||||||| Sbjct: 3158 gaatatgttag 3148 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 3317 cattttattac 3327 (SEQ ID NO: 72) ||||||||||| Sbjct: 1971 cattttattac 1981 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 944 tccaggtggag 954 (SEQ ID NO: 73) ||||||||||| Sbjct: 1569 tccaggtggag 1579 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 470 ctaattttaga 480 (SEQ ID NO: 74) ||||||||||| Sbjct: 236 ctaattttaga 246 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Minus Query: 175 ttttagttagg 185 (SEQ ID NO: 75) ||||||||||| Sbjct: 1426 ttttagttagg 1416 Score = 22.3 bits (11), Expect = 8.0 Identities = 11/11 (100%) Strand = Plus/Plus Query: 167 cttcttggttt 177 (SEQ ID NO: 76) ||||||||||| Sbjct: 2698 cttcttggttt 2708

TABLE 3B DNA files cs.maize.fasta (Query) and cs.rice.fasta (Subject), protein file cs.maize.prot: Query = (4069 letters) >cs.rice.fasta         Length = 8072 Score = 30.2 bits (15), Expect = 0.026 #1 Identities = 15/15 (100%) Strand = Plus/Plus Query: 4034 catggatggtgaggt 4048 (SEQ ID NO: 77) ||||||||||||||| Sbjct: 5062 catggatggtgaggt 5076 Score = 26.3 bits (13), Expect = 0.41 #2 Identities = 13/13 (100%) Strand = Plus/Minus Query: 3982 ttaattgatttta 3994 (SEQ ID NO: 78) ||||||||||||| Sbjct: 7611 ttaattgatttta 7599 Score = 26.3 bits (13), Expect = 0.41 Identities = 13/13 (100%) Strand = Plus/Plus Query: 2687 ggccggcctacgt 2699 (SEQ ID NO: 79) ||||||||||||| Sbjct: 4783 ggccggcctacgt 4795 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 2291 agctagctagct 2302 (SEQ ID NO: 80) |||||||||||| Sbjct: 1810 agctagctagct 1799 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Plus Query: 2291 agctagctagct 2302 (SEQ ID NO: 81) |||||||||||| Sbjct: 1799 agctagctagct 1810 Score = 24.3 bits (12), Expect = 1.6 Identities = 15/16 (93%) Strand = Plus/Minus Query: 2716 acttgcatgcttgtag 2731 (SEQ ID NO: 82) ||||||||| |||||| Sbjct: 1006 acttgcatgtttgtag 991 (SEQ ID NO: 83) Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 756 cgctcgctctac 767 (SEQ ID NO: 84) |||||||||||| Sbjct: 1546 cgctcgctctac 1535 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 439 ctagctctcgct 450 (SEQ ID NO: 85) |||||||||||| Sbjct: 1804 ctagctctcgct 1793 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 2287 agctagctagct 2298 (SEQ ID NO: 86) |||||||||||| Sbjct: 1810 agctagctagct 1799 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 1404 aattaatatata 1415 (SEQ ID NO: 87) |||||||||||| Sbjct: 5858 aattaatatata 5847 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Minus Query: 12 gttaacgtacat 23 (SEQ ID NO: 88) |||||||||||| Sbjct: 1069 gttaacgtacat 1058 Score = 24.3 bits (12), Expect = 1.6 Identities = 12/12 (100%) Strand = Plus/Plus Query: 2287 agctagctagct 2298 (SEQ ID NO: 89) |||||||||||| Sbjct: 1799 agctagctagct 1810 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 330 gtccaactcac 340 (SEQ ID NO: 90) ||||||||||| Sbjct: 731 gtccaactcac 721 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 2218 catgcataaaa 2228 (SEQ ID NO: 91) ||||||||||| Sbjct: 3053 catgcataaaa 3063 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 4059 acctgtgtaag 4069 (SEQ ID NO: 92) ||||||||||| Sbjct: 5086 acctgtgtaag 5096 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 495 aggaagagaga 505 (SEQ ID NO: 93) ||||||||||| Sbjct: 6731 aggaagagaga 6721 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 2092 aaaaaatcaaa 2102 (SEQ ID NO: 94) ||||||||||| Sbjct: 2445 aaaaaatcaaa 2455 Score = 22.3 bits (11), Expect = 6.3 Identities = 14/15 (93%) Strand = Plus/Plus Query: 1341 ttctaatatacgatt 1355 (SEQ ID NO: 95) |||||||||| |||| Sbjct: 2933 ttctaatatatgatt 2947 (SEQ ID NO: 96) Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 1251 tatattatata 1261 (SEQ ID NO: 97) ||||||||||| Sbjct: 5850 tatattatata 5840 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 1710 gtataaattga 1720 (SEQ ID NO: 98) ||||||||||| Sbjct: 5720 gtataaattga 5710 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 2295 agctagctagc 2305 (SEQ ID NO: 99) ||||||||||| Sbjct: 1799 agctagctagc 1809 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 1910 ttttcaattat 1920 (SEQ ID NO: 100) ||||||||||| Sbjct: 7170 ttttcaattat 7180 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 2295 agctagctagc 2305 (SEQ ID NO: 101) ||||||||||| Sbjct: 1810 agctagctagc 1800 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 2598 aaatcttagaa 2608 (SEQ ID NO: 102) ||||||||||| Sbjct: 7832 aaatcttagaa 7822 Score = 22.3 bits (11), Expect = 6.3 Identities = 14/15 (93%) Strand = Plus/Minus Query: 441 agctctcgcttgctc 455 (SEQ ID NO: 103) |||||||||| |||| Sbjct: 1552 agctctcgctcgctc 1538 (SEQ ID NO: 104) Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Minus Query: 2700 gctatatacta 2710 (SEQ ID NO: 105) ||||||||||| Sbjct: 2564 gctatatacta 2554 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 1356 tcatccataca 1366 (SEQ ID NO: 106) ||||||||||| Sbjct: 2689 tcatccataca 2699 Score = 22.3 bits (11), Expect = 6.3 Identities = 14/15 (93%) Strand = Plus/Minus Query: 138 aaaaatatactacat 152 (SEQ ID NO: 107) ||||||||| ||||| Sbjct: 2432 aaaaatataatacat 2418 (SEQ ID NO: 108) Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 1174 agcaaattaaa 1184 (SEQ ID NO: 109) ||||||||||| Sbjct: 3026 agcaaattaaa 3036 Score = 22.3 bits (11), Expect = 6.3 Identities = 11/11 (100%) Strand = Plus/Plus Query: 850 tgtttcttctt 860 (SEQ ID NO: 110) ||||||||||| Sbjct: 2608 tgtttcttctt 2618

TABLE 3C DNA files sh2.maize.fasta (Query) and sh2.rice. small (Subject), protein file sh2.maize.prot: Query = (7320 letters) >sh2.rice.small         Length = 7500 Score = 34.2 bits (17), Expect = 0.003 #1 Identities = 26/29 (89%) Strand = Plus/Plus Query: 935 gttgttgatgtctttacacagttcatctc 963 (SEQ ID NO: 111) |||||||||| ||| ||||| |||||||| Sbjct: 752 gttgttgatggcttcacacaattcatctc 780 (SEQ ID NO: 112) Score = 30.2 bits (15), Expect = 0.043 #2 Identities = 15/15 (100%) Strand = Plus/Plus Query: 5210 tttgtaggttaacat 5224 (SEQ ID NO: 113) ||||||||||||||| Sbjct: 495 tttgtaggttaacat 509 Score = 26.3 bits (13), Expect = 0.68 #3 Identities = 22/25 (88%) Strand = Plus/Plus Query: 1019 ccacaagatcacttcgggaggcaag 1043 (SEQ ID NO: 114) |||||||| ||||| || ||||||| Sbjct: 845 ccacaagaacacttgggcaggcaag 869 (SEQ ID NO: 115) Score = 26.3 bits (13), Expect = 0.68 #4 Identities = 19/21 (90%) Strand = Plus/Plus Query: 165 aaaggtttcatgtgtaccgtg 185 (SEQ ID NO: 116) |||||||||||| ||| |||| Sbjct: 59 aaaggtttcatgcgtatcgtg 79 (SEQ ID NO: 117) Score = 24.3 bits (12), Expect = 2.7 #5 Identities = 18/20 (90%) Strand = Plus/Minus Query: 638 cttcattaaatgtgaatttc 657 (SEQ ID NO: 118) ||||||||||| ||| |||| Sbjct: 6640 cttcattaaatatgattttc 6621 (SEQ ID NO: 119) Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Plus Query: 408 ggcatctccacc 419 (SEQ ID NO: 120) |||||||||||| Sbjct: 336 ggcatctccacc 347 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Minus Query: 980 tcatactctata 991 (SEQ ID NO: 121) |||||||||||| Sbjct: 4739 tcatactctata 4728 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Plus Query: 1281 ttctgttttttc 1292 (SEQ ID NO: 122) |||||||||||| Sbjct: 1428 ttctgttttttc 1439 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Plus Query: 983 tactctatataa 994 (SEQ ID NO: 123) |||||||||||| Sbjct: 4853 tactctatataa 4864 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Minus Query: 7272 tattactaaatt 7283 (SEQ ID NO: 124) |||||||||||| Sbjct: 1584 tattactaaatt 1573 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Minus Query: 5365 ttcattaatatc 5376 (SEQ ID NO: 125) |||||||||||| Sbjct: 7105 ttcattaatatc 7094 Score = 24.3 bits (12), Expect = 2.7 Identities = 12/12 (100%) Strand = Plus/Minus Query: 691 caaaattaatta 702 (SEQ ID NO: 126) |||||||||||| Sbjct: 1473 caaaattaatta 1462 Score = 24.3 bits (12), Expect = 2.7 Identities = 15/16 (93%) Strand = Plus/Plus Query: 5078 tttttcttgaatttgt 5093 (SEQ ID NO: 127) ||||| |||||||||| Sbjct: 2239 tttttattgaatttgt 2254 (SEQ ID NO: 128)

CNSs are Conserved by Sequence

An TTATTACATGPyG (SEQ ID NO: 1) sequence exists in the second intron of maize/rice lg3/osh6 and 1 g4a and b/osh71. This sequence is of special interest because a reverse one hybrid screen showed that it bound the homeodomain of LG3 homeodomain transcription factor in maize. We designed conserved exon2 and exon3 PCR primers such that we hoped to amplify the intron2 from any grass.

The experiments were performed successfully. Amplified fragments were sequenced and piled-up. The results are presented in FIGS. 3A-B. FIG. 3A shows piled-up sequences from various plant genes containing the smaller CNS sequence: TTATTACATGPyG (SEQ ID NO: 1). FIG. 3B is a line up of CNSlg-l2 from representative grasses. FIG. 3C shows phylogenetic relationships between different plants. The asterisk (*) indicates the species in which useful conservation of CNS sequences has been demonstrated.

The results demonstrate that only thing in common among these several sequences, sequences from diverse grass species, is the TTATTACATGPyG (SEQ ID NO: 1) consensus. The rest of intron2 is the control for what noncoding sequence looks like when there is no evolutionary pressure.

We specifically went after this lg3 CNS to check for exact conservation of sequence in grasses. This particular CNS is too small to be used as a PCR primer, so this was the only way to test for it's conservation. We see no reason why this result should be unique or special.

FIG. 3C shows the most recent consensus on the evolutionary relationships among grasses. An asterisk denotes that we have evidence for the sort of CNS conservation that will permit CNS mapping. Our results suggest that that every grass will be amenable to CNS mapping.

The size and frequency of a CNS region indicate the level of biological complexity of a biological process, tissue, organ, system, or other biological component. The CNS regions also provide information regarding the complexity of the organism itself.

Mapping in Non Domesticated (Wild) Species

FIG. 3 shows polymorphisms within species using Ig1CNS3-exon1 primers, which amplify a promoter fragment from grasses.

Two different “wild” grass species are represented: Setaria glauca (five individuals from five different collections; numbers on figures refer to collections) and Setaria fabrii (six individuals from six different collections).

The promoter PCR fragment amplified is about the same size in all 11 individuals (FIG. 3A). However, if this PCR fragment(s) is cut with a 4-cutter enzyme (FIG. 3B), then it is clear that the different species have a different fragment pattern, and also that individuals within both species differ in pattern. Both Setaria glauca and Setaria fabrii are fertile species that each cross among themselves (but not with each other). Thus, grasslg1CNS3 makes a fine mapping primer for each of these two nondomesticated (wild) species. This gel clearly illustrates the utility of CNSs as primer sites useful in amplifying regions of gene that are expected to be polymorpghic within a species.

REFERENCES

The following references are cited numerically in the Example.

-   1. Hardison, R. C. (2000) Trends in Genetics 16, 369-372. -   2. Kellogg, E. A. (2001) Plant Physiology (Rockville) 125,     1198-1205. -   3. Gale, M. D. & Devos, K. M. (1998) Proceedings of the National     Academy of Sciences of the United States of America 95, 1971-1974. -   4. Devos, K. M. & Gale, M. D. (2000) Plant Cell 12, 637-646. -   5. Freeling, M. (2001) Plant Physiology (Rockville) 125, 1191-1197. -   6. Paterson, A. H., Lin, Y.-R., Li, Z., Schertz, K. F., Doebley, J.     F., Pinson, S. R. M., Liu, S.-C., Stansel, J. W. &     Irvine, J. E. (1995) Science (Washington D.C.) 269, 1714-1718. -   7. Tjian, R. (1995) Scientic American 272, 54-61. -   8. Ficket, J. W. & Hatzigeorgiou, A. G. (1997) Genome Research 7,     861-878. -   9. Bucher, P. (1999) Current Opinion in Structural Biology 9,     400-407. -   10. Sumiyama, K., Kim, C.-B. & Ruddle, F. H. (2001) Genomics 71,     260-262. -   11. Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E.,     Miller, W., Rubin, E. M. & Frazer, K. A. (2000) Science (Washington     D.C.) 288, 136-140. -   12. Gottgens, B., Gilbert, J. G. R., Barton, L. M., Grafham, D.,     Rogers, J., Bentley, D. R. & Green, A. R. (2001) Genome Research 11,     87-97. -   13. Flint, J., Tufarelli, C., Peden, J., Clark, K., Daniels, R. J.,     Hardison, R., Miller, W., Philipsen, S., Tan-Un, K. C., McMorrow,     T., Frampton, J., Alter, B. P., Frischauf, A.-M. &     Higgs, D. R. (2001) Human Molecular Genetics 10, 371-382. -   14. Stojanovic, N., Florea, L., Riemer, C., Gumucio, D., Slightom,     J., Goodman, M., Miller, W. & Hardison, R. (1999) Nucleic Acids     Research 27, 3899-3910. -   15. Levy, S., Hannenhalli, S. & Workman, C. (2001) Bioinformatics     (Oxford) 17, 871-877. -   16. Dubchak, I., Brudno, M., Loots, G. G., Pachter, L., Mayor, C.,     Rubin, E. M. & Frazer, K. A. (2000) Genome Research 10, 1304-1306. -   17. Mayor, C., Brudno, M., Schwartz, J. R., Poliakov, A., Rubin, E.     M., Frazer, K. A., Pachter, L. S. & Dubchak, I. (2000)     Bioinformatics (Oxford) 16, 1046-1047. -   18. Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A., Riemer, C.,     Bouck, J., Gibbs, R., Hardison, R. & Miller, W. (2000) Genome     Research 10, 577-586. -   19. Carter, R. J., Dubchak, I. & Holbrook, S. R. (2001) Nucleic     Acids Research 29, 3928-3938. -   20. Vicente-Carbajosa, J., Moose, S. P., Parsons, R. L. &     Schmidt, R. J. (1997) Proceedings of the National Academy of     Sciences of the United States of America 94, 7685-7690. -   21. Forde, B., Heyworth, A., Pywell, J. & Kreis, M. (1985) Nucleic     Acids Research 13, 7327-7339. -   22. Altschul, S. F., Madden, T. L., Schaeffer, A. A., Zhang, J.,     Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Research     25, 3389-3402. -   23. Tatusova, T. A. & Madden, T. L. (1999) FEMS Microbiology Letters     174, 247-250. -   24. Becraft, P. & Freeling, M. (1991) Plant Cell 3, 801-808. -   25. Sylvester, A. W., Cande, W. Z. & Freeling, M. (1990) Development     (Cambridge) 110, 985-1000. -   26. Becraft, P. W., Bongard-Pierce, D. K., Sylvester, A. W.,     Poethig, R. S. & Freeling, M. (1990) Developmental Biology 141,     220-232. -   27. Moreno, M. A., Harper, L. C., Krueger, R. W., Dellaporta, S. L.     & Freeling, M. (1997) Genes & Development 11, 616-628. -   28. Clayton, W. D. & Renvoize, S. A. (1986) Genera graminum: gasses     of the world (H.M.S.O., London). -   29. Grass Phylogeny Working Group at www.ftg.flu.edu/grass/gpwg -   30. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic     Acids Research 22, 4673-4680. -   31. Ansari-Lari, M. A., Oelten, J. C., Schwartz, S., Zhang, Z.,     Muzny, D. M., Lu, J., Gorrell, J. H., Chinault, A. C., Belmont, J.     W., Miller, W. & Gibbs, R. A. (1998) Genome Research 8, 2940. -   32. Gaut, B. S., Morton, B. R., McCaig, B. C. & Clegg, M. T. (1996)     Proceedings of the National Academy of Sciences of the United States     of America 93, 10274-10279. -   33. Freeman, S. & Herron, J. C. (2001) Evolutionary analysis     (Prentice Hall, Upper Saddle River, N.J.). -   34. Li, W.-H. & Graur, D. (1991) Fundamentals of molecular evolution     (Sinauer Associates, Sunderland, Mass.). -   35. Hardison, R. C., Oeltjen, J. & Miller, W. (1997) Genome Research     7, 959-966. -   36. Hogenesch, J. B., Ching, K. A., Batalov, S., Su, A. I.,     Walker, J. R., Zhou, Y., Kay, S. A., Schulz, P. G. &     Cooke, M. P. (2001) Cell 106, 413-415; -   37. Messing, J. & Liaca, V. (1998) Proceedings of the     Nationa/Academy of Sciences of the United States of America 95,     2017-2020.

The following additional references are cited throughout the specificatio.

-   Altschul, S. F., Madden, T. L., Schaeffer, A. A., Zhang, J., Zhang,     Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST:     A new generation of protein database search programs. Nucleic Acids     Res. 25, 3389-3402. -   Azuaje, F. (2002). A cluster validity framework for genome     expression data. Bioinformatics (Oxford) 18, 319-320. -   Birnbaum, K., Benfey, P. N. and Shasha, D. E. (2001). cis     element/transcription factor analysis (cis/TF): A method for     discovering transcription factor/cis element relationships. Genome     Research 11, 1567-1573. -   Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory     element detection using correlation with expression. Nature Genetics     27, 167-171. -   Davidson, E. H., Rast, J. P., Oliveri, P., Ransick, A., Calestani,     C., Yuh, C.-H., Minokawa, T., Amore, G., Hinman, V., Arenas-Mena, C.     et al. (2002). A genomic regulatory network for development. Science     (Washington D.C.) 295, 1669-1678. -   Dubchak, I., Brudno, M., Loots, G. G., Pachter, L., Mayor, C.,     Rubin, E. M. and Frazer, K. A. (2000). Active conservation of     noncoding sequences revealed by three-way species comparisons.     Genome Res. 10, 1304-1306. -   Eddy, S. R. (2002). Computational genomics of noncoding RNA genes.     Cell 109, 137-140. -   Ghosh, D. (2000). Object-oriented Transcription Factors Database     (ooTFD). Nucleic Acids Research 28, 308-310. -   Harmer, S. L., Hogenesch, J. B., Straume, M., Chang, H.-S., Han, B.,     Zhu, T., Wang, X., Kreps, J. A. and Kay, S. A. (2000). Orchestrated     transcription of key pathways in Arabidopsis by the circadian clock.     Science (Washington D.C.) 290, 2110-2113. -   Heinekamp, T., Kuhlmann, M., Lenk, A., Strathmann, A. and     Droege-Laser, W. (2002). The tobacco bZIP transcription factor BZI-1     binds to G-box elements in the promoters of phenylpropanoid pathway     genes in vitro, but it is not involved in their regulation in vivo.     MGG Molecular Genetics and Genomics 267, 16-26. -   Heinemeyer, T., Chen, X., Karas, H., Kel, A. E., Kel, O. V.,     Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F. and     Wingender, E. (1999). Expanding the TRANSFAC database towards an     expert system of regulatory molecular mechanisms. Nucleic Acids     Research 27, 318-322.

Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A. E., Kel, O. V., Ignatieva, E. V., Ananko, E. A., Podkolodnaya, O. A., Kolpakov, F. A. et al. (1998). Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Research 26, 362-367.

-   Higo, K., Ugawa, Y., Iwamoto, M. and Korenaga, T. (1999). Plant     cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic     Acids Research 27, 297-300. -   Honda S.-i., Matsumoto, T. and Harada, N. (2001). Characterization     and purification of a protein binding to the cis-acting element for     brain-specific exon 1 of the mouse aromatase gene. Journal of     Steroid Biochemistry and Molecular Biology 79, 255-260.

Kaplinsky, N. J., Braun, D. M., Penterman, J., Goff, S. A. and Freeling, M. (2002). Utility and distribution of conserved noncoding sequences in the grasses. Proc. Natl. Acad. Sci. USA 99, 6147-6151.

-   Kielbasa, S. M., Korbel, J. O., Beule, D., Schuchhardt, J. and     Herzel, H. (2001). Combining frequency and positional information to     predict transcription factor binding sites. Bioinformatics (Oxford)     17, 1019-1026. -   Lau, N. C., Lim, L. P., Weinstein, E. G. and Bartel, D. P. (2001).     An Abundant Class of Tiny RNAs with Probable Regulatory Roles in     Caenorhabditis elegans. Science 294, 858-862. -   Lee, R. C. and Ambros, V. (2001). An Extensive Class of Small RNAs     in Caenorhabditis elegans. Science 294, 862-864. -   Lescot, M., Dehais, P., Thijs, G., Marchal, K., Moreau, Y.,     Van, d. P. Y., Rouze, P. and Rombauts, S. (2002). PlantCARE, a     database of plant cis-acting regulatory elements and a portal to     tools for in silico analysis of promoter sequences. Nucleic Acids     Research 30, 325-327. -   Levy, S., Hannenhalli, S. and Workman, C. (2001). Enrichment of     regulatory signals in conserved non-coding genomic sequence.     Bioinformatics (Oxford) 17, 871-877. -   Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E.,     Miller, W., Rubin, E. M. and Frazer, K. A. (2000). Identification of     a coordinate regulator of interleukins 4, 13, and 5 by cross-species     sequence comparisons. Science (Washington D.C.) 288, 136-140. -   Markstein, M., Markstein, P., Markstein, V. and Levine, M. S.     (2002). Genome-wide analysis of clustered Dorsal binding sites     identifies putative target genes in the Drosophila embryo.     Proceedings of the National Academy of Sciences of the United States     of America 99, 763-768. -   Mayor, C., Brudno, M., Schwartz, J. R., Poliakov, A., Rubin, E. M.,     Frazer, K. A., Pachter, L. S. and Dubchak, I. (2000). VISTA:     Visualizing global DNA sequence alignments of arbitrary length.     Bioinformatics (Oxford) 16, 1046-1047. -   Storz, G. (2002). An Expanding Universe of Noncoding RNAs. Science     296, 1260-1263. -   Sumiyama, K., Kim, C.-B. and Ruddle, F. H. (2001). An efficient     cis-element discovery method using multiple sequence comparisons     based on evolutionary relationships. Genomics 71, 260-262. -   Toh, H. and Horimoto, K. (2002). Inference of a genetic network by a     combined approach of cluster analysis and graphical Gaussian     modeling. Bioinformatics (Oxford) 18, 287-297. -   Wingender, E., Kel, A. E., Kel, O. V., Karas, H., Heinemeyer, T.,     Dietze, P., Knueppel, R., Romaschenko, A. G. and Kolchanov, N. A.     (1997). TRANSFAC, TRRD and COMPEL: Towards a federated database     system on transcriptional regulation. Nucleic Acids Research 25,     265-268. -   Wyrick, J. J. and Young, R. A. (2002). Deciphering gene expression     regulatory networks. Current Opinion in Genetics & Development 12,     130-136. -   Xu, Y., Olman, V. and Xu, D. (2002). Clustering gene expression data     using a graph-theoretic approach: An application of minimum spanning     trees. Bioinformatics (Oxford) 18, 536-545. -   Zhu, Z., Pilpel, Y. and Church, G. M. (2002). Computational     identification of transcription factor binding sites via a     transcription-factor-centric clustering (TFCC) algorithm. Journal of     Molecular Biology 318, 71-81. 

1. A method of identifying a polymorphism between two or more DNA sequences; comprising: a) selecting a first and a second primer for use in an amplification reaction for each DNA sequence wherein said first or second primer is a CNS in said two or more DNA sequences; b) hybridizing said first and second primers to each DNA sequence to form primer-DNA hybrids; c) amplifying said primer-DNA hybrids to form amplified DNA products; and d) separating said DNA products by size to identify a polymorphism between said two or more DNA sequences.
 2. The method of claim 1 wherein said DNA sequences are plant DNA sequences.
 3. The method of claim 1 wherein said polymorphism is utilized for mapping.
 4. The method of claim 1 wherein said first primer is a CNS and said second primer is an exon sequence.
 5. The method of claim 1 wherein both the first and second primers are CNSS.
 6. The method of claim 1 wherein said DNA products are digested with a restriction endonuclease prior to separation in step d).
 7. The method of claim 1 wherein said DNA products are separated by electrophoresis.
 8. The method of claim 1 wherein said primers are PCR primers.
 9. The method of claim 1 wherein the primer-DNA hybrids are amplified using an amplification technique selected from the group consisting of cloning, PCR, LCR, TAS, 3SR, NASBA and QP amplification.
 10. The method of claim 9 wherein said technique is PCR.
 11. A method of mapping a polymorphism between two or more species; comprising: a) selecting a first and a second primer for use in an amplification reaction wherein said first or second primer is a CNS in the genomes of said two or more species wherein said genomes include genomic DNA; b) hybridizing said first and second primers to said genomic DNA sequences to form primer-genomic DNA hybrids; c) amplifying said primer-genomic DNA hybrids to form amplified DNA products; and d) separating said DNA products by size to map a polymorphism between said two or more DNA sequences.
 12. The method of claim 11 wherein said DNA sequences are plant DNA sequences.
 13. The method of claim 11 wherein said plant DNA is grass DNA.
 14. The method of claim 11 wherein said first primer is a CNS and said second primer is an exon sequence.
 15. The method of claim 11 wherein both the first and second primers are CNSs.
 16. The method of claim 11 wherein said DNA products are digested with a restriction endonuclease prior to separation in step d).
 17. The method of claim 11 wherein said DNA products are separated by electrophoresis.
 18. The method of claim 11 wherein said primers are PCR primers.
 19. The method of claim 1 wherein the primer-DNA hybrids are amplified using an amplification technique selected from the group consisting of cloning, PCR, LCR, TAS, 3SR, NASBA and Qβ amplification.
 20. The method of claim 19 wherein said technique is PCR. 